Calibrating Simple Climate Models to Individual Earth System Models: Lessons Learned From Calibrating Hector

Simple climate models (SCMs) are computationally efficient and capable of emulating global mean output of more complex Earth system models (ESMs). In doing so, SCMs can play a critical role in climate research as stand‐ins for the computationally more expensive models, especially in studies involving low, spatial, and/or temporal resolution, providing more computationally efficient sources of climate data. Here we use Hector v2.5.0 to emulate the multiforcing historical and RCP scenario output for 31 concentration and seven emission‐driven ESMs. When calibrating Hector, sufficient calibration data must be used to constrain the model; otherwise, climate and/or carbon parameters affecting physical processes may be able to trade off with one another, allowing for solutions to use physically unreasonable fitted parameter values as well as limiting the application of the SCM as an emulator. We also present a novel methodology that uses the ESM range as a calibration data, which can be adopted when faced with missing variable output from a specific model.


Introduction
Simple climate models (SCMs) and Earth system models (ESMs) represent the carbon cycle and climate system at different scales. Because ESMs represent more physical and chemical processes in more detail and operate on fine temporal and spatial resolutions, they are computationally expensive and therefore can only be run for a limited length of time for a limited number of scenarios. Conversely, SCMs trade off spatial and temporal resolution to reduce computational expense by containing simplified representations of the major processes governing the carbon-climate system (Hartin et al., 2015;Meinshausen et al., 2011). Given that SCMs are computationally inexpensive to run, they can be used to generate large ensembles of results where high-resolution climate data are not needed for the problem at hand, such as many classes of uncertainty quantification studies (Clarke et al., 2014;Fawcett et al., 2015;Meinshausen et al., 2009). SCMs can also be used in model coupling exercises with economic models (Schultz & Kasting, 1997;Smith & Edmonds, 2006;van Vuuren et al., 2011). Additionally, SCMs can be tuned or calibrated to emulate the more complex ESMs. When acting as emulators, SCMs can be used to generate novel scenarios that were not conducted by the ESMs, improve our understanding of the components of more complex models, and examine the structural uncertainty of ESMs (Meinshausen et al., 2011).
There are a range of calibration procedures within the SCM community. The most common approach used to calibrate SCMs to individual ESMs is a least squares regression (Caldeira & Myhrvold, 2013;Geoffroy et al., 2013;Huntingford et al., 2009;Khabbazan & Held, 2019;Meinshausen et al., 2011). In a least squares regression, an optimization routine solves for the optimal parameter fits that minimize a goodness-of-fit metric between the SCM output data and the ESM calibration data for Earth system variables of interest. Mean or total square error are often the metrics used, and different goodness-of-fit metrics can yield different degrees of calibration success (Caldeira & Myhrvold, 2013;Huntingford et al., 2009;Khabbazan & Held, 2019). Khabbazan and Held (2019), Caldeira and Myhrvold (2013), and Huntingford et al. (2009) used only global mean temperature data in their calibration procedures. While this can produce a good value for the goodness-of-fit metric, Huntingford et al. (2009) also suggested that the calibration procedure must be able identify a unique solution, for which using temperature data alone may be insufficient. Other papers, such as Meinshausen et al. (2011), used both temperature and ocean heat flux as calibration data and were able to find unique calibration solutions for their SCM for concentration-driven ESM simulations.
In this paper we tested four SCM calibration procedures on Hector v 2.5.0. Our objective was to design a calibration protocol that successfully calibrates Hector act as an ESM emulator. To this end, we defined several criteria that indicate a satisfactory calibration. First, SCM results should capture the aggregate behavior of the ESM output. Second, the optimization routine should be able to identify a unique solution. Finally, the fitted parameter values should be physically reasonable. The uniqueness criterion is essential for the purpose here, which is to calibrate Hector to a specific ESM for future use in chains of coupled modeling experiments. If we were trying to calibrate Hector to historical observations (i.e., to the physical Earth system), we would want our calibration to capture the uncertainty in many Earth system variables. However, in calibrating Hector to act as an emulator for a specific ESM, a single set of parameters that allow Hector to reproduce all calibration data for that specific ESM is needed. When a unique solution is not identified, there are multiple equally good parameter fits. Supporting multiple model fits that are equally good is not only impractical but may limit the ability to use Hector as an emulator for novel scenarios.
The results from this paper provide best practices for calibrating Hector, which will help inform future Hector calibration exercises. The procedures and concepts described herein may also be relevant to other exercises in calibrating SCMs to ESMs, particularly when calibrating to an individual ESM with missing variables. We outline a procedure that incorporates the ESM multimodel range as additional calibration data when an individual ESM is missing output variable data for that are otherwise important for the calibration protocol.

Hector v2.5.0
Hector is an open-source, object-oriented, reduced-form global carbon-cycle climate model (Hartin et al., 2015). Hector operates on an annual time step taking in emissions of greenhouse gases, non-CO 2 gases, and aerosols. Hector converts emissions to concentrations where needed and calculates the global change in radiative forcing from which global mean temperature and other climate and carbon cycle variables are calculated. Hector v2.5.0 (Vega-Westhoff et al., 2019) incorporates ocean-atmosphere heat flux dynamics using a one-dimension diffusive ocean heat flux model, DOECLIM (Kriegler, 2005). The carbon-cycle component of Hector is divided into a one box atmosphere, a three box terrestrial system with vegetation, soil, and detritus, where net primary production is a function of atmospheric CO 2 and heterotrophic respiration is a function of global mean temperature and a four box ocean.
The model is driven with exogenous global emissions or atmospheric concentrations. In the emissionsdriven mode, Hector is driven with CO 2 , non-CO 2 greenhouse gases, and aerosol emissions as time series inputs. When driven in emissions mode, carbon-cycle feedbacks are active, such as CO 2 fertilization and heterotrophic respiration. Alternatively, Hector can be run in the concentration-driven mode, where instead of using CO 2 emissions, CO 2 concentrations are prescribed, whereby design the carbon cycle is not active. These two modes, or model configurations, are comparable to the ESM Coupled Model Intercomparison Project Phase 5 (CMIP5) emissions and concentration-driven runs (Taylor et al., 2012). From an emulation perspective, we calibrate to the two different kinds of runs separately, as if they were different models of different physical processes (Hourdin et al., 2017).
In this paper we calibrated both the concentration and emissions-driven versions of Hector. During the concentration-driven experiments, four Hector climate parameters were tuned: climate sensitivity S, ocean heat diffusivity , the aerosol forcing scaling factor a , and the volcanic forcing scaling factor v . S ( • C) 10.1029/2019EA000980 characterizes Hector's long-term temperature changes in response to changes in CO 2 concentrations, and (cm 2 s −1 ) is the coefficient in the diffusion equation that controls the rate at which heat diffuses into the deep ocean. The remaining climate parameters, a and v , are unitless scale factors used to adjust Hector's temperature sensitivity to aerosol and volcanic radiative forcing, respectively. Higher a and v values indicate more temperature sensitivity to aerosols or volcanic forcing, respectively. During the emissions-driven calibration experiments, three Hector carbon-cycle parameters were calibrated in addition to the four Hector climate parameters previously mentioned. The three carbon-cycle parameters are preindustrial CO 2 concentration C 0 (ppm v ), CO 2 fertilization factor (unitless), and heterotrophic respiration temperature sensitivity factor Q 10 (unitless). Although the preindustrial CO 2 concentration is well constrained in observations, preindustrial CO 2 concentrations vary by about 10 ppmv across individual emission-driven ESM runs (Friedlingstein et al., 2014). In order for Hector to accurately emulate the emissions-driven ESM CO 2 concentrations over time, Hector must capture the preindustrial CO 2 concentration of the ESM. Therefore, preindustrial CO 2 concentration was included as a calibration parameter.
Parameters within Hector's ocean carbon cycle are not directly calibrated in this exercise. Given the active marine carbonate chemistry within the ocean, there are numerous parameters that are more difficult to constrain based on more data availability. Instead of direct calibration, the ocean carbon cycle is indirectly captured via the other parameters as they change the climate and carbon cycle. The ocean carbon flux is a function of the atmospheric CO 2 concentrations and surface temperature, which vary based on the other parameters. As in Meinshausen et al. (2011), we strive to avoid overfitting to minimize the number of parameters being calibrated.

Calibration Data Processing
In this paper, Hector parameters were calibrated to output from individual ESMs in participating in the Coupled Model Intercomparison Project Phase 5 (CMIP5) (Taylor et al., 2012). CMIP5 results from the historical, rcp26, rcp45, rcp60, rcp85, esmHistorical, and esmrcp85 scenarios were used to produce the global mean temperature, atmospheric CO 2 concentration, and ocean heat flux ESM calibration data. First, CMIP5 outputs were annually and globally averaged to match the resolution of Hector results. Since a limited number of ESMs provided ocean heat flux data, we calculated ocean heat flux using six atmospheric variables: incoming shortwave and longwave radiation at the surface (S d and L d ), upward shortwave and longwave radiation at the surface (S u and L u ), and sensible and latent heat flux (F s and F l ). For each grid cell at each time step, total heat flux F T is equal to Mean annual ocean heat flux calculations were isolated to the ocean by using land masks for each model. Incomplete data were discarded to ensure consistent and comparable calibration across all models. Additionally, ESMs that did not have historical temperature results and temperature results for at least one future scenario were also excluded from the calibration experiments.

Calibration Experiments
We used the downhill simplex algorithm (Nelder & Mead, 1965) to minimize the distance (measured as mean squared error, MSE; also referred to as the goodness of fit) between standardized ESM and Hector output data. Standardizing the ESM and Hector data enables cross-comparison between different variables by preventing the optimal goodness of fit from being dominated by output from a single variable. We also adopted an equal weighting methodology to prevent scenarios with multiple ensemble members from having a greater influence on optimal goodness-of-fit. By minimizing the multiscenario, multivariable MSE average, the optimization routine ensured that all variable and scenario output was being fitted equally well.

Concentration-Driven Fits
Of the models that participated in CMIP5 (Taylor et al., 2012), we were able to gather concentration-driven temperature and ocean heat flux results from 31 ESMs (listed in Table 1). During the concentration-driven runs, the models were forced with prescribed atmospheric CO 2 concentrations, non-CO 2 emissions, and aerosols for the historical period (historical experiment) and four alternative future scenarios that span a range of end of century radiative forcing (rcp26, rcp45, rcp60, and rcp85). To illustrate the importance of appropriate constraint data, we performed two concentration-driven calibration experiments. The first (section 5.1) only used temperature data, whereas the second (section 5.2) used both temperature and ocean-heat-flux data.

Temperature-Only Fits
The first concentration-driven calibration experiment tuned Hector parameters only using ESM temperature to calibrate the four Hector climate parameters; this is a similar approach to one taken by Khabbazan and Held (2019) and Geoffroy et al. (2013). Hector does not model interannual variability, and therefore, some discrepancies between Hector and ESM output are expected. Hector's ability to capture the global mean temperature trend was fairly consistent across the 31 models. The average RMSE between Hector and each ESM was 0.31 ± 0.08 • C, where the "±" range indicates one standard deviation of the distribution of RMSE values across the 31 Hector fits. The best fit had a RMSE of 0.21 • C, and the worst had a value of 0.60 • C. Three of the 31 fits were selected and visualized in Figure 1 to illustrate a range of Hector's performance as an emulator. Visualizations for the remaining fits are provided in the supporting information. The fitted values of S, , a , and v for all 31 models are reported in Table 1.
Even though Hector successfully emulates ESM temperature (Figure 1), the calibration results are unsatisfactory because they have physically unreasonable parameter values. Knutti et al. (2017) and Stocker et al. (2013) estimate that S is unlikely (≤5% probability) to lie outside of the interval (1, 6) • C. Diffusion of heat into the ocean must be strictly positive, so we consider an additional constraint that ≥ 10 −6 . As indicated in Figure 2, not all of the fitted values of S and fall within or near these reasonable ranges. In fact, only 12 of the 31 temperature-only calibration results are reasonable parameter values for all parameters considered.
A primary issue is that the temperature-only calibrations fail to identify unique solutions. The inability to identify unique solutions limits the practical use of Hector as an emulator of an individual ESM for novel emission pathways. When the calibration protocol fails to produce unique solutions, it means the system is insufficiently constrained. This can result in equally good calibration fits that use both physically reasonable and physically unreasonable parameters for Hector. We explored a series of temperature-only concentration-driven calibration experiments in which S, v , and a were calibrated for the same three models (illustrated in Figure 1) at fixed values of . Figure 3 indicates equal goodness of fit for all prescribed values of . There is no defined minimum because the higher values of are compensated with higher calibrated values of S, which ultimately produce well-calibrated versions of Hector. Similar behavior of the tradeoff between S and was observed by Forest et al. (2002) and Sansó and Forest (2009).

Temperature-Heat Flux Fits
To address the problems of uniqueness illustrated in Figure 3, we explored an additional calibration experiment in which ocean heat flux was added as a constraint. This was inspired by Huntingford et al. (2009), who postulated that ocean heat uptake could be used in instances where global temperature data were an insufficient calibration constraint.
This calibration protocol requires ocean heat flux data for at least one future scenario. However, we were only able to calculate ocean heat flux for 18 of the 31 ESMs to which we were trying to calibrate. For these 18 ESMs, we were able to follow the calibration procedure described in section 4. In lieu of excluding the 13 ESMs without ocean heat flux values from the Hector calibration exercises, we approximated the ocean heat flux constraint based on the range of available CMIP5 ocean heat flux values. Doing this allows us to calibrate Hector to each of the remaining 13 ESMs that lack ocean heat flux data.
To approximate these missing ocean heat flux data, we assume that the missing ocean heat flux values would lie somewhere within the range of the other CMIP5 ocean heat flux output. We then do the following: For any set of Hector parameters (including ), Hector will output global mean temperature and global mean heat flux values. The temperature RMSE is computed as previously described. For ocean heat flux time series data, x, computed by Hector, we calculate the penalty function where x min and x max are set to the lowest and highest heat flux values observed in the available ESM ocean heat flux data, is set to a value of 0.15 to force the sides of the penalty function to be steep, and E 2 (x) is the Gaussian error function. ocean heat flux outside of the range [x min , x max ], the value of L(x) will be sufficiently high that the goodness of fit is necessarily poor, and the nonlinear optimization algorithm will not choose those parameter values. For values of ocean heat flux within the range, the value of L(x) is sufficiently low that the goodness of fit is dominated by temperature RMSE, with no preference given to any particular heat flux value within the range.
Adding the ocean heat flux constraint caused the fitted parameter values to change while having a negligible impact on Hector's emulated temperature. The difference in temperature output from Hector between the temperature-only and temperature-heat flux fits, measured as the RMSE, is 0.01 • C. Figure 5  . Results from temperature-heat flux calibration of the three sample models (CCSM4, GFDL-CM3, and MRI-CGCM3), at fixed values of κ. Panel (a) compares χ, which represents the calibration results goodness of fit to the ESM temperature and ocean heat flux results as κ is varied. There is a unique solution because there is a global minimum in the χ as κ is varied, indicated by the black asterisks at κ is equal to 3.9, 2.8, and 4.45, for CCSM4, GFDL-CM3, and MRI-CGCM3, respectively. It is now possible to identify unique solutions, because now κ effects the goodness of fit independently of S because of the effects κ has on the ocean heat flux goodness of fit.
By including heat flux constraints, the optimization routine is able to identify unique combinations of Hector parameters ( Figure 6) with more realistic ocean heat flux results as visualized in the supporting information.

Emissions-Driven Fits
Unlike the concentration-driven scenarios, the CMIP5 emissions-driven (esmHistorical and esmrcp85) runs were driven with CO 2 , other non-CO 2 greenhouse gases, and aerosol emissions. Only a handful of the CMIP5 generation ESMs had the ability to represent climate-carbon feedbacks (Taylor et al., 2012). As such, finding sufficient comparison data to use for calibrating Hector presented a challenge.
The emission-driven calibration required complete esmHistorical and esmrcp85 output for atmospheric CO 2 and temperature as well as esmrcp85 heat flux until year 2050. We were able to obtain a total of six complete sets of emissions-driven results that met these requirements. One additional model, MIROC-ESM, had insufficient esmrcp85 ocean heat flux output, so we used the CMIP5 range (Equation 2) for the ocean heat flux constraint. In total, Hector was calibrated to emulate seven different emissions-driven ESMs.
The emissions-driven calibration experiments calibrated a total of seven Hector parameters: four climate and three carbon-cycle parameters. The four climate parameters were the same parameters calibrated during the concentration experiments S, ocean heat diffusivity , the aerosol forcing scaling factor a , and the volcanic forcing scaling factor v . The three new carbon-cycle parameters are preindustrial CO 2 concentration C 0 (ppm v ), the CO 2 fertilization factor (unitless), and the heterotrophic respiration temperature sensitivity factor Q 10 (unitless).

Temperature-Heat Flux-CO 2 Fits
Although Hector is able to emulate emissions-driven ESM results fairly well, Hector overestimates historical temperature and underestimates future temperature for GFDL-ESM2G (Figure 7). Hector calibration fits for CanESM2, CESM1-BGC, and GFDL-ESM2G were selected for main text illustrations of Hector's ability to emulate temperature, atmospheric CO 2 , and future ocean heat flux (all Hector fits are visualized in the supporting information). The fitted values of S, , a , and v (Table 2) are comparable to the fitted values from the concentration-driven temperature-heat flux calibration (Table 1). . Emulated Hector atmospheric CO 2 (a-c), ocean heat flux (d-f), and temperature (g-i) for CanESM2, CESM1-BGC, and GFDL-ESM2G (blue) using the temperature-heat flux-CO 2 calibration protocol compared with ESM temperature data (gray). No historical ocean heat flux data were used in the calibration process. Note. The temp-heat flux-CO 2 calibration experiment used ESM temperature, future ocean heat flux, and atmospheric CO 2 to calibrate Hector, whereas the temp-heat flux-CO 2 penalty calibration experiment used the same comparison data but added a penalty to the goodness of fit based on the value of Q 10 . In the concentration-driven temperature-only calibration protocol, the optimization algorithm was unable to identify unique parameterizations of S and ; similarly, here the emissions-driven calibration was unable to identify unique parameterizations of and Q 10 , where controls the vegetation productivity response to CO 2 concentration while Q 10 controls the response of heterotrophic respiration to temperature. Since these parameters affect physical processes that are related to one another, it is hard for the optimization routine to identify a unique combination of these parameters to use in ESM emulation. Over a wide range of fixed values, the optimized goodness-of-fit value is fairly constant (Figure 8), because and Q 10 can trade off. As increases, the CO 2 fertilization effect increases, and Hector's terrestrial module responds to the increasing CO 2 concentrations with an increase in productivity. To offset the atmospheric CO 2 that is being used by the terrestrial ecosystems while maintaining the RCP atmospheric CO 2 concentration and global mean temperature, Q 10 must also increase.

Temperature-Heat Flux-CO 2 Penalty Fits
Further constraining these tradeoffs between Q 10 and would ideally be handled by introducing additional carbon-cycle-related ESM output data such as Net Primary Production (NPP), Net Ecosystem Exchange, terrestrial carbon flux, and carbon flux from heterotrophic respiration, in a similar fashion to our previous description of the concentration-driven calibration procedure (section 5.2). However, because of the limited availability of the emissions-driven scenario results (including unavailability for many ESMs of data for many of these variables), we instead elected to penalize the calibration fit with Q 10 parameter values that fell outside of empirically derived ranges as defined by the results of Todd-Brown et al. (2018) and Shao et al. (2013). The Q 10 parameter penalty function was equal to the negative logarithm of a truncated normal distribution with a lower bound of 0, a mean of 1.75, and a standard deviation of 0.4. The Q 10 parameter penalty function was designed to impact the goodness of fit, which is now the average of the temperature MSE, ocean heat flux MSE, and the Q 10 penalty. When the fitted value of the Q 10 parameter is physically unreasonable, the value returned by the Q 10 parameter penalty function is large, penalizing the goodness of fit. , α a (c), α v (d), C 0 (e), β (f), and Q 10 (g) for the two emissions-driven calibration experiments, where one uses a Q 10 parameter penalty (blue) and one does not (gray). The arrows indicate the direction of change in parameter values when the Q 10 penalty was incorporated into the calibration protocol.
After applying the Q 10 parameter penalty, the difference between the emulated temperature, ocean heat flux, and atmospheric CO 2 output between the two different emissions-driven calibration experiments was minor. Each RMSE values were 0.05 • C, 0.2 W m −2 , and 47 ppm v CO 2 . These differences are less than 5% of the end of the century output values. Applying the Q 10 parameter penalty had the most dramatic effect on the fitted values of Q 10 but also impacted the six other calibrated Hector parameter values (Figure 9). The Q 10 parameter penalty function narrowed the range where and Q 10 parameter values could compensate for one another without impacting the goodness-of-fit value. Although the goodness of fit is still somewhat constant between the fixed values of 0.0 to 0.2 for , when was larger than 0.1, the goodness of fit steadily gets worse as increases (Figure 10a). Furthermore, it appears that models require small values in order to have reasonable Q 10 values: Three of the fitted values of are less than 0.1 (Table 2). While values less than 0.1 are small, these values are not impossibly small (Jiang et al., 2020;Jones et al., 2018). This calibration protocol restricted the fitted values of Q 10 (Figure 10b) from 1.3 to 1.8, which is more consistent with Q 10 reported by Todd-Brown et al. (2018) and Shao et al. (2013) than the fitted values of Q 10 from the first emissions-driven calibration experiment. There is a range where, as β is varied, the quality of the fit changes because Q 10 is limited in its ability to compensate for the change in carbon fertilization.

Discussion and Conclusions
Calibrating a SCM to act as an emulator of an ESM is a nontrivial task. In order to act as a successful emulator, the emulator should be able to reproduce the characteristics of the comparison data utilizing physically reasonable parameter values. To successfully use Hector as a concentration-driven CMIP5 ESM emulator, Hector must be calibrated using a combination of temperature and ocean heat flux comparison data. Similar to the tradeoff between S and during the temperature-only calibration experiment, we identified two carbon cycle parameters, and Q 10 that were able to compensate for each other in the emissions-driven calibration experiments when insufficient comparison data were used. For both calibration exercises, we introduced data-based penalty functions (such as Equation 2 and the Q 10 parameter penalty) to constrain tradeoff parameters (in concentration driven exercises) and Q 10 (in emissions driven). The use of penalty functions is broadly applicable to other SCMs wishing to calibrate to individual ESMs with missing output variables. We presented penalty methods that relied on data from both ESM archives and the experimental literature, illustrating the broad usefulness of the technique. In the future, other penalty-based calibration exercises may be performed with Hector, particularly as CMIP6 model results become available. The inclusion of such penalties to calibrate to ESMs with missing variables must be balanced, however, with caution regarding overfitting.
If left unconstrained, SCM parameters can compensate for one another, making it difficult for nonlinear optimization routines to identify unique solutions. This finding has implications for other SCM calibration exercises. As we saw in the concentration-driven calibrations, insufficiently constrained calibration protocols can lead to unreasonable fitted parameter values, limiting the ability to use the SCM as an emulator to test and design new scenarios. Both the concentration-driven and emission-driven protocols depend on heat flux ESM output in order to identify unique fitted values of S and that use reasonable fitted parameter values. The key findings of this paper have implications for the broader climate modeling and scenario communities by illustrating the information necessary for successful SCM calibration. Although future SCM calibration exercises may be able to take advantage of new scenarios and ESM output from the Phase 6 of the Coupled Model Intercomparison Project (CMIP6), ensuring that sufficient comparison data that are used will continue to be crucial to SCM calibration. Furthermore, using the ESM output range as an additional