The International Land Model Benchmarking (ILAMB) System: Design, Theory, and Implementation
Abstract
The increasing complexity of Earth system models has inspired efforts to quantitatively assess model fidelity through rigorous comparison with best available measurements and observational data products. Earth system models exhibit a high degree of spread in predictions of land biogeochemistry, biogeophysics, and hydrology, which are sensitive to forcing from other model components. Based on insights from prior land model evaluation studies and community workshops, the authors developed an open source model benchmarking software package that generates graphical diagnostics and scores model performance in support of the International Land Model Benchmarking (ILAMB) project. Employing a suite of in situ, remote sensing, and reanalysis data sets, the ILAMB package performs comprehensive model assessment across a wide range of land variables and generates a hierarchical set of web pages containing statistical analyses and figures designed to provide the user insights into strengths and weaknesses of multiple models or model versions. Described here is the benchmarking philosophy and mathematical methodology embodied in the most recent implementation of the ILAMB package. Comparison methods unique to a few specific data sets are presented, and guidelines for configuring an ILAMB analysis and interpreting resulting model performance scores are discussed. ILAMB is being adopted by modeling teams and centers during model development and for model intercomparison projects, and community engagement is sought for extending evaluation metrics and adding new observational data sets to the benchmarking framework.
1 Introduction
As Earth system models (ESMs) become increasingly complex and observational data volumes rapidly expand, there is a growing need for comprehensive and multifaceted evaluation of model fidelity. Process‐rich ESMs pose challenges to developers implementing new parameterizations or tuning process representations, and to the broader community seeking information about the skill of model predictions. Model developers and software engineers require a systematic means for evaluating changes in model results to ensure that developments improve the scientific performance of target process representations while not adversely affecting results in other, possibly less familiar, parts of the model. To advance understanding and predictability of terrestrial biogeochemical processes and their interactions with hydrology and climate under conditions of increasing atmospheric carbon dioxide, rigorous analysis methods, employing best available observational data, are required to objectively assess and constrain model predictions, inform model development, and identify needed measurements and field experiments (Hoffman et al., 2017).
- develop internationally accepted benchmarks for land model performance by drawing upon international expertise and collaboration;
- promote the use of these benchmarks by the international community for model intercomparison and development;
- strengthen linkages among experimental, remote sensing, and climate modeling communities in the design of new model tests, benchmarks, and measurement programs; and
- support the design and development of a new, open source, benchmarking software system for use by the international community.
Three ILAMB workshops have been held—in Exeter, UK, in 2009; Irvine, California, USA, in 2011 (Luo et al., 2012); and Washington, DC, USA, in 2016 (Hoffman et al., 2017)—to engage the modeling, measurements, and remote sensing communities in the identification of observational data sets and the design of model evaluation metrics. In this way, community consensus was sought for the curation of observational data and the methodology of model evaluation and scoring, which are described below.
Recognition that the capacities of the terrestrial and marine biosphere to store anthropogenic carbon will weaken under climate warming (Cox et al., 2000; Denman et al., 2007; Friedlingstein et al., 2001; Fung et al., 2005; Mahowald et al., 2017; Moore et al., 2018; Randerson et al., 2015) and that uncertainties in carbon cycle feedbacks must be quantified and reduced to improve projections of future climate change (Arora et al., 2013; Ciais et al., 2013; Friedlingstein et al., 2006, 2014; Gregory et al., 2009; Hoffman et al., 2014) has inspired efforts to quantitatively evaluate model performance through comparison with in situ and remote sensing observations (Anav et al., 2013; Eyring et al., 2016). Multimodel simulation results from the third Coupled Model Intercomparison Project (CMIP3; Meehl et al., 2007) and fifth CMIP (CMIP5; Taylor et al., 2012), which informed the Intergovernmental Panel on Climate Change Fourth and Fifth Assessment Reports (AR4 and AR5), provided opportunities for developing and testing model evaluation diagnostics, formal metrics, and exploration of benchmarking concepts and techniques. Early work on coupled model evaluation and establishing formal metrics focused primarily on atmospheric variables (Gleckler et al., 2008; Reichler & Kim, 2008). Following the first two ILAMB workshops, the land modeling community began exploring standardized and comprehensive benchmarking for terrestrial carbon cycle models (Abramowitz, 2012; Anav et al., 2013; Blyth et al., 2011; Bouskill et al., 2014; Cadule et al., 2010; Dalmonech & Zaehle, 2013; Ghimire et al., 2016; Kelley et al., 2013; Piao et al., 2013). While some researchers define benchmarking as a series of model tests based on a predefined expected level of performance (Abramowitz, 2005; Best et al., 2015), most of the systematic benchmarking strategies explored by the land modeling community to date do not depend upon the establishment of an expected level of performance.
The ILAMB software package, hereafter referred to as ILAMB, shares some of the same goals as existing model diagnostic and evaluation tools, such as the Protocol for the Analysis for Land Surface models (Abramowitz, 2012), the Program for Climate Model Diagnosis and Intercomparison Metrics Package (Gleckler et al., 2016), the ESM Evaluation Tool (Eyring et al., 2016), the Land surface Verification Toolkit (Kumar et al., 2012), and a wide variety of often custom‐developed diagnostic packages in use at international modeling centers. Some of these tools provide model‐to‐model comparisons, a large collection of stand‐alone graphical diagnostics, or workflow infrastructure that allows one to regenerate analysis results from previously published studies but with new model outputs. In contrast, ILAMB was designed to compare multiple models or model versions with observations simultaneously, assess functional relationships between prognostic variables and one or more forcing variables through variable‐to‐variable comparisons (e.g., gross primary production vs. precipitation), and score model performance across a suite of metrics, variables, and data sets. Model performance is evaluated for variables in categories of biogeochemistry (Table 2), hydrology (Table 3), radiation and energy (Table 4), and climate forcing (Table 5).
For every variable, ILAMB generates graphical diagnostics (spatial contour maps, time series line plots, and Taylor diagrams; Taylor, 2001) and scores model performance for the period mean, bias, root‐mean‐square error (RMSE), spatial distribution, interannual coefficient of variation, seasonal cycle, and long‐term trend. Model performance scores are calculated for each metric and variable and are scaled based on the degree of certainty of the observational data set, the scale appropriateness, and the overall importance of the constraint or process to model predictions, following a customizable rubric described below (Table 1). Scores are aggregated across metrics and data sets, producing a single scalar score for each variable for every model or model version. As shown in Figure 1, these scalar scores are presented graphically. On the left side we use a stoplight color scheme to indicate aggregate performance for each model by variable. On the right, we show relative performance (i.e., Z score), indicating which models or model versions perform better with respect to others contained in the overall analysis.
| Score | Certainty | Scale | Process |
|---|---|---|---|
| No given uncertainty, | Site level observations with | Observations that have | |
| 1 | significant methodological | limited space/time coverage | limited influence on the |
| issues affecting quality | targeted Earth system dynamics | ||
| No given uncertainty, some | Partial regional coverage, | Observations have direct | |
| 2 | methodological issues | up to 1 year | influence on the targeted |
| affecting quality | Earth system dynamics | ||
| No given uncertainty, | Regional coverage, | Observations useful to constrain | |
| 3 | methodology has some | at least 1 year | processes that contribute |
| peer review | to the targeted Earth system dynamics | ||
| 4 | Qualitative uncertainty, | Important regional coverage, | Observations well suited to |
| methodology accepted | at least 1 year | constrain important processes | |
| Well‐defined and relatively | Global scale spanning | Observations well suited for | |
| 5 | low uncertainty | multiple years | discriminating critical |
| processes among models |
- Note. A score for each data set is assigned in each of three areas. These scores are then combined multiplicatively and used to determine relative importance for a data set with respect to a given variable. ILAMB = International Land Model Benchmarking.

We do not view these aggregate absolute scores as a determinant of good or bad models. We envision the scores as a tool to more quickly identify relative differences among models and model versions which the scientist must then interpret. As in any evaluation methodology, many of our choices are subjective and must be considered as the scores are interpreted. Where possible, the ILAMB implementation allows for users to customize weights and diagnostics in order to incorporate aspects of model performance relevant to their scientific goals. ILAMB may be thought of as a framework which may be expanded to incorporate community ideas regarding model benchmarking. Thus, while our choices are subjective, they are informed by the preferences of a larger community and can be considered as an initial suggestion.
The remainder of this paper describes the ILAMB methodology used to compute aggregate absolute scores. First we describe how we compare an individual observational data set to model output (section 2). Then we explain how scores are aggregated across data sets for each variable and present the data sets used in the land model evaluation (section 3). In section 4 we present some salient points about how the ILAMB software is designed. Finally, in section 5 we discuss what ILAMB scores mean and how they should be used.
2 Methodology
In this section we describe the methodology used to assess how well a model captures information contained in a reference (e.g., observational) data set. For the purposes of this section, we discuss the analysis of a generalized variable v(t,x), which we assume represents a piecewise discontinuous function of constants in space and time. This means that the temporal domain, represented by the variable t, is defined by the beginning and ending of time intervals and the spatial domain, represented by the variable x (in bold to emphasize it is a vector quantity), represents the areas created by cell boundaries or the areas associated with data sites. When necessary, we use the subscript ref to reflect a variable whose source is a reference or observational data set, and the subscript mod for model data sets.
While many statistical quantities may be computed, the goal of our initial methodology is to examine the mean state and variability around the mean over monthly to decadal time scales and grid cell to global spatial scales. While we intend to uniformly apply this analysis procedure to all variables, we also implement a mechanism to skip certain aspects when deemed inappropriate. For example, if a reference data set only contains average information across a span of years, the annual cycle is undefined and automatically skipped in our implementation. The implementation also allows users to skip aspects of the analysis that are deemed inappropriate even if it is possible to compute these metrics using the available data. For example, the interannual variability may be poorly characterized in a reference data set even though the quantity could be computed.
2.1 Preliminary Definitions
Before presenting the specifics of the ILAMB methodology, we first present some definitions used throughout the paper. While the following definitions are widely used in the community, there are many subtle choices in their implementation that affect the interpretation of the results. We present them here with precise meanings to emphasize where a choice has been made and our reasoning for making it.
2.1.1 Mean Values Over Time
(1)
(2)
and
are the initial and final times of each time interval. The average value is obtained by dividing through by the amount of time in the interval, tf−t0, replaced in our discrete approximation by the following function.
(3)In words, equation 3 addresses temporally discontinuous data by summing all the time step interval sizes only if the corresponding variable data are marked as valid. This means that if a function has some values masked or marked as invalid at some locations, we do not penalize the averaged value by including this as a time at which a value is expected. If an integral (or sum) is desired instead of an average, then we simply omit the division by T(x) in equation 1.
2.1.2 Mean Values Over Space
(4)
(5)Note that if no weighting is required, this is a normalization by the sum of the area over which we integrate. As with the temporal mean, if an integral only is required, we simply omit the division by A(Ω). In cases where a mean over a collection of sites is needed, the spatial integral reduces to an arithmetic mean across the sites.
(6)
(7)
(8)Once constructed, quantities defined on both
and
may be interpolated to
by nearest neighbor interpolation with zero interpolation error due to the nested nature of the grids. This can be seen visually by comparing the three plots shown in Figure 2a. In each plot, the tick marks along the x axis represent the cell breaks of the particular one‐dimensional grid left coarse for illustration. The cyan curve represents a step function defined on the grid of a reference data set
and the magenta curve on that of the model data set
. Both are interpolated to the composed grid
without loss of information, albeit on a new grid containing more cells of variable size. Once on a composite grid, the quantities may be compared directly. As the ILAMB methodology has been envisioned for comparisons with model output from CMIP5, we have made an implicit assumption that each source grid,
and
, is regular and can be represented by one‐dimensional vectors. While the implementation does provide naive interpolation for nonregular grids, the user is encouraged to employ a conservative interpolation scheme of their choosing prior to applying the ILAMB methodology.

and
both interpolated to a composite grid
using nearest neighbor interpolation with zero interpolation error. The vertical grid lines reflect the cell boundaries in each grid. (b) Differences in the representation of land from a reference and model data set zoomed into Central America for emphasis. The red region represents where both sources are in agreement, the blue is land for the model but not the reference, and the green is land for the reference but not the model.
In addition to resolution differences, we observe that data sources vary in the underlying representation of the distinction between land and water. We illustrate this concept in Figure 2b where we compare a fine scale representation of land
to a relatively coarse representation
. This is a typical situation encountered when comparing high‐resolution observational data to lower‐resolution model output. The red region represents the intersection of land areas
, that is, where both sources report the presence of land. However, there are missed land areas from both sources, represented by the blue and green colors. As much of the disagreement over what is considered land occurs around islands in tropical regions (e.g., Central America and Equatorial Asia), these nonrepresented areas can constitute a nontrivial percentage of the total represented variable v.
For transparency, the ILAMB implementation is built with the capability of reporting integrals over each of these three land areas. Unless specifically stated otherwise, when spatially integrating a quantity from a single source, we use the original grid and land areas given by that source. This is to remain as true to the original intent of the provider as we can. However, when comparing two data sources of varying resolution and land representation, we perform this integration over what both report to be land,
(the red area in Figure 2b).
2.1.3 Computing Normalized Scores From Errors
(9)
to equate to a score of
, then
(10)In Figure 3 we plot this function with two choices for α, which illustrates how the relative error may be controlled. Unless stated otherwise, we use an implicit α = 1 throughout the manuscript.

2.2 Mean State Analysis
In this section, we describe the various metrics and plots that our methodology generates. While presented in terms of the abstract variable v, we also include sample plots of a comparison of the GBAF (Jung et al., 2010) gross primary productivity (GPP) with CLM4.5 (Oleson et al., 2013) for the purpose of illustration. In practice, ILAMB produces thousands of such plots and scalars, which are browsable in a website designed to aid modelers in understanding the benchmarking results.
2.2.1 Bias
, over the time period of the reference, as well as that of the model,
, over the same time period. These are spatial variables that are included in the standard output as plots, as shown in Figures 4a and 4b. We also compute the bias,
(11)
. To score the bias, we need to nondimensionalize it as a relative error. We have chosen to do this by using the centralized RMS of the reference data,
(12)
(13)
(14)
(15)
. In some instances the variable is truly a mass, but other times a flux or rate. The main motivation is to weigh in areas where the variable is active. So while in our conceptual example, there is large relative error in GPP over deserts, these values will not negatively contribute to the overall score as the value of GPP is low in this area.

; (b) model period mean,
; (c) bias, bias(x); (d) bias score, sbias(x).
We apply mass weighting when the variable v represents a mass or flux of carbon or water as in GPP or precipitation. For variables representing energy states or quantities, such as temperature and radiation, we omit the weighting and perform a spatial integral only. We report plots of the bias and its score as well as the scalar integrated mean values.
2.2.2 RMSE
(16)
in the standard output (Figure 5a). To score the RMSE, we normalize the centralized RMSE,
(17)
(18)
(19)
(20)2.2.3 Phase Shift
(21)
(22)
(23)
and
shown in Figure 6.

and
; (b) mean annual cycle,
and
.
2.2.4 Interannual Variability
(24)
(25)
(26)
(27)
(28)
2.2.5 Spatial Distribution
(29)
and
, and then assigning a score by the following relationship
(30)
2.2.6 Overall Score
(31)2.3 Relationship Analysis
As models are frequently calibrated using the mean state scalar measures described in section 2.2, a higher score does not necessarily reflect a more process‐oriented model. In order to assess the representation of mechanistic processes in models, we also evaluate variable‐to‐variable relationships. For example, we look at how well models represent the relationship that GPP has with precipitation, evapotranspiration, and temperature. For the purposes of this section, we represent a generic dependent variable as v, as before, and score its relationship with an independent variable u. We then quantify the variable‐to‐variable relationship of the time period mean,
on
, derived from the combination of reference data sets to the relationship diagnosed in models. We use the mean values over the reference time period to establish relationships as they represent a logical starting point. In the future, we plan to extend the relationship analysis to include seasonal and interannual variability.
2.3.1 Functional Response
with a number of bins, initially set to nbins=25. Then in each bin, we compute the mean value of the corresponding dependent variable,
to approximate the functional dependence of u on v. We represent this binning with the operator
that operates on the dependent and independent variables. We use it to compute functions from both the reference and model data sets.
(32)
(33)
(34)
. Then we use equation 9 to map this relative error to a score by
(35)
The superscript u reinforces that this score represents functional performance with respect to a given independent variable u. The ILAMB implementation allows for any number of independent variables to be studied. In terms of our sample, ILAMB scores the functional relationship of GPP with respect to each independent variable separately (precipitation, evapotranspiration, temperature, etc.) and then computes the mean of these scores for the overall relationship score.
2.3.2 Hellinger Distance
and
, represented here by the operator
. We represent these distributions by
(36)
(37)
and
, we can compute the so‐called Hellinger distance (Law et al., 2015)
(38)However, we only report the Hellinger distance as a scalar and do not include it in the scoring of the relationships. This is due to the fact that a bias in an independent variable can cause a density shift in the 2‐D distribution that would cause the score to unreasonably decrease. In terms of our example, a bias in precipitation (e.g., arising from a coupled model) could result in a poor relationship score with GPP, even if there is no underlying deficiency in the land‐model‐simulated precipitation versus GPP relationship.
3 Data Sets
In this section we explain how we utilize the methodology presented in section 2 to evaluate model performance with respect to a collection of data sets (Tables 2– 5) assembled by the ILAMB community. Errors in measurements, lack of measured or reported uncertainties, and inconsistencies in measurement methodology or instrumentation leading to ambiguous confidence in derived or synthesized data products all represent challenges in using observational data for benchmarking. In addition, the spatial and temporal coverage of different data products can vary substantially.
| Variable/Data Set | Certainty | Scale | Process |
|---|---|---|---|
| Biomass | 5 | ||
| Tropical (Saatchi et al., 2011) | 4 | 4 | |
| NBCD2000 (Kellndorfer et al., 2013) | 4 | 2 | |
| USForest (Blackard et al., 2008) | 4 | 2 | |
| Burned area | 4 | ||
| GFED4S (Giglio et al., 2010) | 4 | 5 | |
| Gross primary productivity | 5 | ||
| Fluxnet (Lasslop et al., 2010) | 3 | 3 | |
| GBAF (Jung et al., 2010) | 3 | 5 | |
| Leaf area index | 3 | ||
| AVHRR (Myneni et al., 1997) | 3 | 5 | |
| MODIS (De Kauwe et al., 2011) | 3 | 5 | |
| Global net ecosystem carbon balance | 5 | ||
| GCP (Le Quéré et al., 2016) | 4 | 5 | |
| Hoffman (Hoffman et al., 2014) | 4 | 5 | |
| Net ecosystem exchange | 5 | ||
| Fluxnet (Lasslop et al., 2010) | 3 | 3 | |
| GBAF (Jung et al., 2010) | 2 | 2 | |
| Ecosystem respiration | 4 | ||
| Fluxnet (Lasslop et al., 2010) | 2 | 3 | |
| GBAF (Jung et al., 2010) | 2 | 2 | |
| Soil carbon | 5 | ||
| HWSD (Todd‐Brown et al., 2013) | 3 | 5 | |
| NCSCDV22 (Hugelius et al., 2013) | 3 | 4 |
- Note. Weights are chosen using the rubric in Table 1 and reflect a focus on understanding the carbon cycle.
| Variable/Data Set | Certainty | Scale | Process |
|---|---|---|---|
| Evapotranspiration | 5 | ||
| GLEAM (Miralles et al., 2011) | 3 | 5 | |
| MODIS (De Kauwe et al., 2011) | 3 | 5 | |
| Evaporative fraction | 5 | ||
| GBAF (Jung et al., 2010) | 3 | 3 | |
| Latent heat | 5 | ||
| Fluxnet (Lasslop et al., 2010) | 3 | 1 | |
| GBAF (Jung et al., 2010) | 3 | 3 | |
| Runoff | 5 | ||
| Dai (Dai & Trenberth, 2002) | 3 | 5 | |
| Sensible heat | 2 | ||
| Fluxnet (Lasslop et al., 2010) | 3 | 3 | |
| GBAF (Jung et al., 2010) | 3 | 5 | |
| Terrestrial water storage anomaly | 5 | ||
| GRACE (Swenson & Wahr, 2006) | 5 | 5 |
- Note. Weights are chosen using the rubric in Table 1 and reflect a focus on understanding the carbon cycle.
| Variable/Data Set | Certainty | Scale | Process |
|---|---|---|---|
| Albedo | 1 | ||
| CERES (Kato et al., 2013) | 4 | 5 | |
| GEWEX.SRB (Stackhouse et al., 2011) | 4 | 5 | |
| MODIS (De Kauwe et al., 2011) | 4 | 5 | |
| Surface upward SW radiation | 1 | ||
| CERES (Kato et al., 2013) | 4 | 4 | |
| GEWEX.SRB (Stackhouse et al., 2011) | 4 | 5 | |
| WRMC.BSRN (König‐Langlo et al., 2013) | 4 | 3 | |
| Surface net SW radiation | 1 | ||
| CERES (Kato et al., 2013) | 4 | 5 | |
| GEWEX.SRB (Stackhouse et al., 2011) | 4 | 5 | |
| WRMC.BSRN (König‐Langlo et al., 2013) | 4 | 3 | |
| Surface upward LW radiation | 1 | ||
| CERES (Kato et al., 2013) | 4 | 5 | |
| GEWEX.SRB (Stackhouse et al., 2011) | 4 | 5 | |
| WRMC.BSRN (König‐Langlo et al., 2013) | 4 | 3 | |
| Surface net LW radiation | 1 | ||
| CERES (Kato et al., 2013) | 4 | 5 | |
| GEWEX.SRB (Stackhouse et al., 2011) | 4 | 5 | |
| WRMC.BSRN (König‐Langlo et al., 2013) | 4 | 3 | |
| Surface net radiation | 2 | ||
| CERES (Kato et al., 2013) | 4 | 5 | |
| Fluxnet (Lasslop et al., 2010) | 4 | 3 | |
| GEWEX.SRB (Stackhouse et al., 2011) | 4 | 5 | |
| WRMC.BSRN (König‐Langlo et al., 2013) | 4 | 3 |
- Note. Weights are chosen using the rubric in Table 1 and reflect a focus on understanding the carbon cycle.
| Variable/Data Set | Certainty | Scale | Process |
|---|---|---|---|
| Surface air temperature | 2 | ||
| CRU (Harris et al., 2014) | 5 | 5 | |
| Fluxnet (Lasslop et al., 2010) | 3 | 3 | |
| Precipitation | 2 | ||
| CMAP (Xie & Arkin, 1997) | 4 | 5 | |
| Fluxnet (Lasslop et al., 2010) | 3 | 3 | |
| GPCC (Schneider et al., 2014) | 4 | 5 | |
| GPCP2 (Adler et al., 2012) | 4 | 5 | |
| Surface relative humidity | 3 | ||
| ERA (Dee et al., 2011) | 2 | 5 | |
| Surface downward SW radiation | 2 | ||
| CERES (Kato et al., 2013) | 4 | 5 | |
| Fluxnet (Lasslop et al., 2010) | 4 | 3 | |
| GEWEX.SRB (Stackhouse et al., 2011) | 4 | 5 | |
| WRMC.BSRN (König‐Langlo et al., 2013) | 4 | 3 | |
| Surface downward LW radiation | 1 | ||
| CERES (Kato et al., 2013) | 4 | 5 | |
| GEWEX.SRB (Stackhouse et al., 2011) | 4 | 5 | |
| WRMC.BSRN (König‐Langlo et al., 2013) | 4 | 3 |
- Note. Weights are chosen using the rubric in Table 1 and reflect a focus on understanding the carbon cycle.

A third weight reflects how useful the measured variable is in the focus of a model intercomparison project (MIP). Here, as an example, we show weighting for an analysis of model performance in representing the carbon cycle. We use these weights to blend the overall scores from each variable into a complete score across all variables for a given model. This allows ILAMB to include comparisons that are important for a complete understanding of the carbon cycle without necessarily allowing them to heavily influence the overall score. For example, the radiation and energy cycle data sets in Table 4 are all weighted comparatively low because, while they help one understand the carbon cycle, they are not as influential in the overall behavior.
We emphasize that this rubric is particular to our overarching goal of understanding the carbon cycle on global and decadal scales. However, the implementation is flexible and allows for an arbitrary weighting scheme to be developed that suits the needs of the user, community, or MIP that it serves.
The references and weights for each data set that we have selected may be found in Tables 2-5. Each table represents a different aspect of the model: the ecosystem and carbon cycle in Table 2, the hydrological cycle in Table 3, the radiation and energy cycle in Table 4, and the forcings in Table 5. For the majority of these data sets, we make a direct comparison of the observed quantity to model outputs, or algebraic combinations of model outputs using the methodology described in section 2. However, there are a few special cases which require specific handling which we describe in the next section.
3.1 Special Cases
In general, a consistent methodology is applied to compare model output with each data set. This consistency across variables and data sets is a strength of the ILAMB methodology. However, this is not always possible, and here we enumerate a few exceptions and how they are handled.
3.1.1 Evaporative Fraction
(39)The expression can cause nonsensical results because in winter, the sensible heat flux can be negative, leading to a change of sign in the evaporative fraction. The expression can also lead to large evaporative fraction values since the magnitudes of both the latent and sensible heat can become small. For this reason, we apply a mask to ef,Le, and Sh only considering values for which Sh>0,Le>0, and Sh+Le>ϕ, where ϕ = 20 (W/m2) is a surface energy threshold.
(40)Beyond this change, the evaporative fraction is evaluated using the methodology defined in section 2.
3.1.2 Albedo
(41)
(42)3.1.3 Global Net Ecosystem Carbon Balance
(43)
(Pg C). We use this uncertainty to normalize the difference in accumulation at the end of the time period as a measure of relative error,
(44)
(45)
, equation 30, where the correlation and standard deviation are taken across the temporal dimension. Then the overall score is
(46)
(47)
(48)3.1.4 Runoff
We use the Dai and Trenberth (2002) river discharge data set to assess model performance of runoff for the world's 50 largest river basins. First, we compute the mean annual runoff from the model over the time period of the observational data set. Then we take the river discharge data and distribute it over the area of the river basins and compare this to the mean runoff over the same basin. This simple approach was taken to allow us to compare runoff across models even if they do not have a river routing model.
We include plots of the mean runoff of the reference and model over river basins and the bias, represented in Figure 10. We also include regional mean runoff plots for each of the river basins included but only show that of the Amazon river basin in Figure 10d. The model performance is then scored using the bias (section 2.2.1), the interannual variability (section 2.2.4), and the spatial distribution (section 2.2.5) metrics.

3.1.5 Terrestrial Water Storage Anomaly
We use the Gravity Recovery and Climate Experiment (Swenson & Wahr, 2006) data set to assess the terrestrial water storage anomaly in models. However, there are a few challenges in producing a fair comparison. The first of those is that models report only the storage and so the anomaly must be computed. The more serious challenge is that the resolution of this data is quite coarse (300–400 km]), and thus, pointwise comparisons are not appropriate (Swenson, 2013). Instead, we compare mean anomaly values over 30 of the world's largest river basins. In this way the comparison is more fair as it is over large areas and automatically omits dry areas, which are not of interest.
We include plots of the magnitude of the mean anomaly of the reference and model over river basins and the RMSE, represented in Figure 11. We also include regional mean anomaly plots for each of the river basins but show only that of the Amazon river basin in Figure 11d. The model performance is then scored using the RMSE (section 2.2.2) and the interannual variability (section 2.2.4) metrics.

4 Software
We have implemented the methodology described in sections 2 and 3 into a software package that is freely available to the community. We previously developed a prototype implementation (Mu et al., 2015) based on the National Center for Atmospheric Research Command Language. We then moved the algorithm into an open source, openly developed python package (Collier et al., 2016) in an effort to produce a product to which the community can more easily make contributions. The referenced digital object identifier will lead to the software repository, where the source code and documentation can be found. The documentation includes the public interface as well as tutorials that span topics such as installation, basic usage, adding models or benchmark data sets, and formatting benchmark data sets.
The ILAMB package is designed to ingest data sets that follow the Climate and Forecast convention (Eaton et al., 2017). The Climate and Forecast website explains that the “conventions define metadata that provide a definitive description of what the data in each variable represents, and the spatial and temporal properties of the data. This enables users of data from different sources to decide which quantities are comparable, and facilitates building applications with powerful extraction, regridding, and display capabilities.” We have built the ILAMB package to embody this philosophy, making it directly useful to those who adhere to this standard. While model intercomparison efforts, such as CMIP5, have encouraged the use of these conventions among modelers, the observational community has not yet widely put them into practice. Much of the work in adding data sets to the collection is in encoding them to follow this convention.
For the purpose of communicating how the ILAMB package works, consider the configure file shown in Figure 12, which defines a set of observational data sets that will be used to confront models. The h1 bracket is a heading used to categorize variables, represented by the h2 heading. This comparison involves the surface upward shortwave radiation and the albedo, both of which are variables belonging to the radiation and energy cycle. Inside each h2 heading, we specify the variable name that will be compared (rsus is the netCDF variable name for surface upward shortwave radiation). However, we provide a mechanism for variable synonyms in this case by specifying alternate variable names. If the ILAMB system cannot find the main variable, it will try to find any alternates that the user specifies. This allows the software to encourage the use of standard variable names but accounts for modeling groups wanting to use ILAMB without preprocessing. Also note the derived keyword in the albedo section. While the components of albedo are part of standard model output, the albedo is not. The ILAMB package allows for users to specify algebraic relationships in the configure file process. This makes the process automatic and transparent to those who may read this configure file.

The ILAMB package will ingest this configure file and try to build commensurate quantities from model outputs. While observational data sets come in different forms (globally gridded remote sensing products, tower data collections, etc.), the ILAMB system reads the spatial and temporal information found in the file and uses it to trim, subsample, and/or coarsen the model data as appropriate.
5 Discussion
The ILAMB framework is designed to be both powerful and flexible. While we have made choices in the default configuration, described above, focused on global analysis for decadal to centennial scale ESMs, ILAMB allows the user to customize selection of variables, weighting of data sets, and spatial subsetting that make it useful for assessing results from mesoscale weather forecasting or other models. We envision developing a library of sample configuration files, targeting various well‐known models and model applications.
As much of the usefulness of ILAMB depends on the quality of the underlying observational data, we recommend that data providers include explicit representations of the underlying spatial grids including the areas over which quantities have been averaged. Observational data sets frequently report mean values in a cell taken over an area which may include land but also portions of lakes, rivers, and oceans. This leads to ambiguity with regard to the contribution of land cover types to the measurement itself and subsequently adds to the uncertainty when comparing values to model output.
5.1 Interpreting the Overall Score
The thrust of this paper is to detail a methodology for computing a single overall score that captures a model's skill in reproducing patterns found in the observed record. However, we do not view the absolute value of the score as particularly meaningful beyond the precise definition described in this paper. In general, no model can achieve a perfect score for any given variable for several reasons.
First, there is measurement error and uncertainty in the observational data that makes a perfect score unlikely or undesirable against even a single data set. This is what motivates some in the community (Abramowitz, 2005; Best et al., 2015) to pose that benchmarking requires an expectation of performance, which is admittedly lacking in our approach. Second, despite that every attempt is made to employ multiple independent data sets of high quality for confrontation with models, these data sets are inconsistent with each other, making a perfect score across all data sets impossible. We do this as comparisons with multiple observational and synthesized data sets for a single variable to offer the user more information about the robustness of model predictions within the limits of observational uncertainty at varying spatial and temporal scales. Third, a lower score with respect to a given variable is not necessarily a sign of a poor model. It may rather highlight the need for better measurement campaigns or improved metrics (i.e., sometimes we learn that our measurements are incomplete or do not acknowledge important uncertainties, or our metrics are inappropriate for a given data set).
The overall score is meant to aid the scientist in discovering when meaningful changes have occurred in the model or across models. The holistic nature of the ILAMB suite of data sets and metrics helps provide a synthesis of model performance that directs the attention of the user to relevant aspects. While we present Figure 1 as the main result of the ILAMB methodology, it is intended to merely indicate variables of particular interest for further consideration. ILAMB output is presented as a hierarchy of interactive web pages that employ JavaScript features to present information to users in a logical and intuitive fashion. From the graphical overview, the user can select individual variables and data sets from the Results Table tab to be led to pages that detail the contributing factors to the model's overall score. On this new page, predefined spatial regions can be individually selected, causing the tabular data and diagnostics to be updated automatically to reflect information relevant only to that region. Although all the tabular information, scores, and graphical diagnostics are precomputed and generated when ILAMB is run, the web‐based interface is designed to facilitate discovery and understanding of model results. The overall score does not replace the scientist, it guides her/him to the relevant plots and diagnostics.
5.2 How Is ILAMB Used?
The ILAMB package is particularly useful for verification, that is, during model development to confirm that new model code improves performance in a targeted area without degrading performance in another area, and for validation, that is, when comparing performance of one model or model version to that of other models or model versions.
In developing and applying the ILAMB package, we have incorporated a wide variety of representative observational data sets (see Tables 2-5), and we have favored data that have the most open data policies. In many cases, these data have been averaged or remapped to be more directly comparable with model output. As this collection of data sets grows, maintaining and distributing the latest versions will be challenging and require community collaboration. For tracking the evolving performance of models over the long term, it may be necessary to maintain access to older versions of data as well as the latest version since corrections to observational data sets can significantly impact model performance scores. Various technologies could fill this role, and the Observations for Climate Model Intercomparisons (obs4MIPs; https://www.earthsystemcog.org/projects/obs4mips/) activity shows promise as a potential solution to this challenge. The preferred solution would ideally support versioning and allow for long‐lived versions associated with ILAMB releases. In the interim, we have implemented a simple scheme for sharing summarized and remapped data sets through a web server.
The ILAMB package is currently being used by individual model developers and international modeling centers. ILAMB offers developers a quick and easy method for checking the impacts of new model development before committing code changes. For modeling centers, ILAMB provides a systematic assessment of historical simulation experiments and enables tracking of performance of model revisions. ILAMB will also be useful for MIPs as a starting point for evaluating model variability and uncertainty. As a part of such MIPs, investigators may wish to develop custom metrics or incorporate data sets specific to their purposes. ILAMB could be executed automatically as model results are uploaded to a system like the Earth System Grid Federation (https://esgf.llnl.gov/) to give users a first look at variation in results and to determine if output should be downloaded for a particular study. ILAMB diagnostics can also be useful for parameter sensitivity studies or for optimization experiments in combination with an automated modeling framework like the Predictive Ecosystem Analyzer (http://pecanproject.org/; Dietze et al., 2014; LeBauer et al., 2013). For the assessments community, the results of a multimodel ILAMB evaluation could be useful for understanding which model results would be appropriate for use in studying impacts and which models may poorly capture processes relevant to the impacts under consideration.
5.3 Future Work
Development of the ILAMB package is ongoing, and the terrestrial modeling and observational communities are being engaged to identify in situ and remote sensing data sets, to define additional evaluation metrics, and to use the package for a wide variety of MIPs (Hoffman et al., 2017). While most effort has been invested in global‐ and regional‐scale model evaluation, new work is focused on improved benchmarking for site level time series, spatial transects, and seasonal and diurnal variability. Future development will include incorporation of experiment‐specific model evaluation metrics derived from prior studies, including Free‐Air CO2 Enrichment (Walker et al., 2014, 2015; Zaehle et al., 2014), nutrient addition, rainfall exclusion, and warming experiments (Bouskill et al., 2014; Zhu et al., 2016). Partner activities, like NASA's Permafrost Benchmarking System project and the Arctic‐Boreal Vulnerability Experiment, are integrating additional data sets and building metrics for specific regions, study areas, or processes of interest. We are applying the ILAMB methodology and code base to develop a marine biogeochemical model benchmarking tool, called the International Ocean Model Benchmarking package.
Based on previous prototypes and community discussion, we developed the ILAMB model benchmarking package for evaluating the fidelity of land carbon cycle models. The package generates graphical diagnostics and computes a comprehensive set of statistics through model‐data comparisons and scores model performance for a wide variety of variables for a suite of observational data sets. Rigorously defined model evaluation metrics and strategies for handling multiple resolutions and land masks are documented above. The ILAMB package is open source and is becoming widely adopted by modeling centers and for informing model intercomparison studies. We are actively seeking community involvement in adding more evaluation metrics and new observational data sets.
Acknowledgments
This manuscript has been authored by UT‐Battelle, LLC under contract DE‐AC05‐00OR22725 with the U.S. Department of Energy. The U.S. Government retains and the publisher, by accepting the article for publication, acknowledges that the U.S. Government retains a nonexclusive, paid‐up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for U.S. Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). This research was supported through the Reducing Uncertainties in Biogeochemical Interactions through Synthesis and Computation Scientific Focus Area (RUBISCO SFA), which is sponsored by the Regional and Global Climate Modeling (RGCM) Program in the Climate and Environmental Sciences Division (CESD) of the Office of Biological and Environmental Research (BER) in the U.S. Department of Energy Office of Science. Oak Ridge National Laboratory (ORNL) is managed by UT‐Battelle, LLC for the U.S. Department of Energy under contract DE‐AC05‐00OR22725. The National Center for Atmospheric Research (NCAR) is managed by the University Corporation for Atmospheric Research (UCAR) on behalf of the National Science Foundation (NSF). Lawrence Berkeley National Laboratory (LBNL) is managed and operated by the Regents of the University of California under contract DE‐AC02‐05CH11231.
References
Citing Literature
Number of times cited according to CrossRef: 39
- Holger Metzler, Qing Zhu, William Riley, Alison Hoyt, Markus Müller, Carlos A. Sierra, Mathematical Reconstruction of Land Carbon Models From Their Numerical Output: Computing Soil Radiocarbon From C Dynamics, Journal of Advances in Modeling Earth Systems, 10.1029/2019MS001776, 12, 1, (2020).
- Johannes Meyerholt, Kerstin Sickel, Sönke Zaehle, Ensemble projections elucidate effects of uncertainty in terrestrial nitrogen limitation on future carbon uptake, Global Change Biology, 10.1111/gcb.15114, 26, 7, (3978-3996), (2020).
- Rosie A. Fisher, Charles D. Koven, Perspectives on the Future of Land Surface Models and the Challenges of Representing Complex Terrestrial Systems, Journal of Advances in Modeling Earth Systems, 10.1029/2018MS001453, 12, 4, (2020).
- G. Danabasoglu, J.‐F. Lamarque, J. Bacmeister, D. A. Bailey, A. K. DuVivier, J. Edwards, L. K. Emmons, J. Fasullo, R. Garcia, A. Gettelman, C. Hannay, M. M. Holland, W. G. Large, P. H. Lauritzen, D. M. Lawrence, J. T. M. Lenaerts, K. Lindsay, W. H. Lipscomb, M. J. Mills, R. Neale, K. W. Oleson, B. Otto‐Bliesner, A. S. Phillips, W. Sacks, S. Tilmes, L. Kampenhout, M. Vertenstein, A. Bertini, J. Dennis, C. Deser, C. Fischer, B. Fox‐Kemper, J. E. Kay, D. Kinnison, P. J. Kushner, V. E. Larson, M. C. Long, S. Mickelson, J. K. Moore, E. Nienhouse, L. Polvani, P. J. Rasch, W. G. Strand, The Community Earth System Model Version 2 (CESM2), Journal of Advances in Modeling Earth Systems, 10.1029/2019MS001916, 12, 2, (2020).
- Genan Wu, Xitian Cai, Trevor F. Keenan, Shenggong Li, Xiangzhong Luo, Joshua B. Fisher, Ruochen Cao, Fa Li, Adam J Purdy, Wei Zhao, Xiaomin Sun, Zhongmin Hu, Evaluating three evapotranspiration estimates from model of different complexity over China using the ILAMB benchmarking system, Journal of Hydrology, 10.1016/j.jhydrol.2020.125553, (125553), (2020).
- S. M. Burrows, M. Maltrud, X. Yang, Q. Zhu, N. Jeffery, X. Shi, D. Ricciuto, S. Wang, G. Bisht, J. Tang, J. Wolfe, B. E. Harrop, B. Singh, L. Brent, S. Baldwin, T. Zhou, P. Cameron‐Smith, N. Keen, N. Collier, M. Xu, E. C. Hunke, S. M. Elliott, A. K. Turner, H. Li, H. Wang, J.‐C. Golaz, B. Bond‐Lamberty, F. M. Hoffman, W. J. Riley, P. E. Thornton, K. Calvin, L. R. Leung, The DOE E3SM v1.1 Biogeochemistry Configuration: Description and Simulated Ecosystem‐Climate Responses to Historical Changes in Forcing, Journal of Advances in Modeling Earth Systems, 10.1029/2019MS001766, 12, 9, (2020).
- Andrew J. Wiltshire, Maria Carolina Duran Rojas, John M. Edwards, Nicola Gedney, Anna B. Harper, Andrew J. Hartley, Margaret A. Hendry, Eddy Robertson, Kerry Smout-Day, JULES-GL7: the Global Land configuration of the Joint UK Land Environment Simulator version 7.0 and 7.2, Geoscientific Model Development, 10.5194/gmd-13-483-2020, 13, 2, (483-505), (2020).
- D N Huntzinger, K Schaefer, C Schwalm, J B Fisher, D Hayes, E Stofferahn, J Carey, A M Michalak, Y Wei, A K Jain, H Kolus, J Mao, B Poulter, X Shi, J Tang, H Tian, Evaluation of simulated soil carbon dynamics in Arctic-Boreal ecosystems, Environmental Research Letters, 10.1088/1748-9326/ab6784, 15, 2, (025005), (2020).
- Charles D. Koven, Ryan G. Knox, Rosie A. Fisher, Jeffrey Q. Chambers, Bradley O. Christoffersen, Stuart J. Davies, Matteo Detto, Michael C. Dietze, Boris Faybishenko, Jennifer Holm, Maoyi Huang, Marlies Kovenock, Lara M. Kueppers, Gregory Lemieux, Elias Massoud, Nathan G. McDowell, Helene C. Muller-Landau, Jessica F. Needham, Richard J. Norby, Thomas Powell, Alistair Rogers, Shawn P. Serbin, Jacquelyn K. Shuman, Abigail L. S. Swann, Charuleka Varadharajan, Anthony P. Walker, S. Joseph Wright, Chonggang Xu, Benchmarking and parameter sensitivity of physiological and vegetation dynamics using the Functionally Assembled Terrestrial Ecosystem Simulator (FATES) at Barro Colorado Island, Panama, Biogeosciences, 10.5194/bg-17-3017-2020, 17, 11, (3017-3044), (2020).
- Thomas Gasser, Léa Crepin, Yann Quilcaille, Richard A. Houghton, Philippe Ciais, Michael Obersteiner, Historical CO<sub>2</sub> emissions from land use and land cover change and their uncertainty, Biogeosciences, 10.5194/bg-17-4075-2020, 17, 15, (4075-4101), (2020).
- Duane Waliser, Peter J. Gleckler, Robert Ferraro, Karl E. Taylor, Sasha Ames, James Biard, Michael G. Bosilovich, Otis Brown, Helene Chepfer, Luca Cinquini, Paul J. Durack, Veronika Eyring, Pierre-Philippe Mathieu, Tsengdar Lee, Simon Pinnock, Gerald L. Potter, Michel Rixen, Roger Saunders, Jörg Schulz, Jean-Noël Thépaut, Matthias Tuma, Observations for Model Intercomparison Project (Obs4MIPs): status for CMIP6, Geoscientific Model Development, 10.5194/gmd-13-2945-2020, 13, 7, (2945-2958), (2020).
- Zheng Shi, Steven D. Allison, Yujie He, Paul A. Levine, Alison M. Hoyt, Jeffrey Beem-Miller, Qing Zhu, William R. Wieder, Susan Trumbore, James T. Randerson, The age distribution of global soil carbon inferred from radiocarbon measurements, Nature Geoscience, 10.1038/s41561-020-0596-z, (2020).
- Joe R. Melton, Vivek K. Arora, Eduard Wisernig-Cojoc, Christian Seiler, Matthew Fortier, Ed Chan, Lina Teckentrup, CLASSIC v1.0: the open-source community successor to the Canadian Land Surface Scheme (CLASS) and the Canadian Terrestrial Ecosystem Model (CTEM) – Part 1: Model framework and site-level performance, Geoscientific Model Development, 10.5194/gmd-13-2825-2020, 13, 6, (2825-2850), (2020).
- Christopher P. O. Reyer, Ramiro Silveyra Gonzalez, Klara Dolos, Florian Hartig, Ylva Hauf, Matthias Noack, Petra Lasch-Born, Thomas Rötzer, Hans Pretzsch, Henning Meesenburg, Stefan Fleck, Markus Wagner, Andreas Bolte, Tanja G. M. Sanders, Pasi Kolari, Annikki Mäkelä, Timo Vesala, Ivan Mammarella, Jukka Pumpanen, Alessio Collalti, Carlo Trotta, Giorgio Matteucci, Ettore D'Andrea, Lenka Foltýnová, Jan Krejza, Andreas Ibrom, Kim Pilegaard, Denis Loustau, Jean-Marc Bonnefond, Paul Berbigier, Delphine Picart, Sébastien Lafont, Michael Dietze, David Cameron, Massimo Vieno, Hanqin Tian, Alicia Palacios-Orueta, Victor Cicuendez, Laura Recuero, Klaus Wiese, Matthias Büchner, Stefan Lange, Jan Volkholz, Hyungjun Kim, Joanna A. Horemans, Friedrich Bohn, Jörg Steinkamp, Alexander Chikalanov, Graham P. Weedon, Justin Sheffield, Flurin Babst, Iliusi Vega del Valle, Felicitas Suckow, Simon Martel, Mats Mahnken, Martin Gutsch, Katja Frieler, The PROFOUND Database for evaluating vegetation models and simulating climate impacts on European forests, Earth System Science Data, 10.5194/essd-12-1295-2020, 12, 2, (1295-1320), (2020).
- Samantha J. Basile, Xin Lin, William R. Wieder, Melannie D. Hartman, Gretchen Keppel-Aleks, Leveraging the signature of heterotrophic respiration on atmospheric CO<sub>2</sub> for model benchmarking, Biogeosciences, 10.5194/bg-17-1293-2020, 17, 5, (1293-1308), (2020).
- Stijn Hantson, Douglas I. Kelley, Almut Arneth, Sandy P. Harrison, Sally Archibald, Dominique Bachelet, Matthew Forrest, Thomas Hickler, Gitta Lasslop, Fang Li, Stephane Mangeon, Joe R. Melton, Lars Nieradzik, Sam S. Rabin, I. Colin Prentice, Tim Sheehan, Stephen Sitch, Lina Teckentrup, Apostolos Voulgarakis, Chao Yue, Quantitative assessment of fire and vegetation properties in simulations with fire-enabled vegetation models from the Fire Model Intercomparison Project, Geoscientific Model Development, 10.5194/gmd-13-3299-2020, 13, 7, (3299-3318), (2020).
- Amadini Jayasinghe, Scott Elliott, Anastasia Piliouras, Jaclyn Clement Kinney, Georgina Gibson, Nicole Jeffery, Forrest Hoffman, Jitendra Kumar, Oliver Wingenter, Modeling Functional Organic Chemistry in Arctic Rivers: An Idealized Siberian System, Atmosphere, 10.3390/atmos11101090, 11, 10, (1090), (2020).
- Rose Z. Abramoff, Margaret S. Torn, Katerina Georgiou, Jinyun Tang, William J. Riley, Soil Organic Matter Temperature Sensitivity Cannot be Directly Inferred From Spatial Gradients, Global Biogeochemical Cycles, 10.1029/2018GB006001, 33, 6, (761-776), (2019).
- Samantha R. Weintraub, Alejandro N. Flores, William R. Wieder, Debjani Sihi, Claudia Cagnarini, Daniel Ruiz Potma Gonçalves, Michael H. Young, Li Li, Yaniv Olshansky, Roland Baatz, Pamela L. Sullivan, Peter M. Groffman, Leveraging Environmental Research and Observation Networks to Advance Soil Carbon Science, Journal of Geophysical Research: Biogeosciences, 10.1029/2018JG004956, 124, 5, (1047-1055), (2019).
- Xitian Cai, William J. Riley, Qing Zhu, Jinyun Tang, Zhenzhong Zeng, Gautam Bisht, James T. Randerson, Improving Representation of Deforestation Effects on Evapotranspiration in the E3SM Land Model, Journal of Advances in Modeling Earth Systems, 10.1029/2018MS001551, 11, 8, (2412-2427), (2019).
- Qing Zhu, William J. Riley, Jinyun Tang, Nathan Collier, Forrest M. Hoffman, Xiaojuan Yang, Gautam Bisht, Representing Nitrogen, Phosphorus, and Carbon Interactions in the E3SM Land Model: Development and Global Benchmarking, Journal of Advances in Modeling Earth Systems, 10.1029/2018MS001571, 11, 7, (2238-2258), (2019).
- Linnia R. Hawkins, David E. Rupp, Doug J. McNeall, Sihan Li, Richard A. Betts, Philip W. Mote, Sarah N. Sparrow, David C. H. Wallom, Parametric Sensitivity of Vegetation Dynamics in the TRIFFID Model and the Associated Uncertainty in Projected Climate Change Impacts on Western U.S. Forests, Journal of Advances in Modeling Earth Systems, 10.1029/2018MS001577, 11, 8, (2787-2813), (2019).
- Rosie A. Fisher, William R. Wieder, Benjamin M. Sanderson, Charles D. Koven, Keith W. Oleson, Chonggang Xu, Joshua B. Fisher, Mingjie Shi, Anthony P. Walker, David M. Lawrence, Parametric Controls on Vegetation Responses to Biogeochemical Forcing in the CLM5, Journal of Advances in Modeling Earth Systems, 10.1029/2019MS001609, 11, 9, (2879-2895), (2019).
- David R. Jackson, Tim J. Fuller‐Rowell, Dan J. Griffin, Matthew J. Griffith, Christopher W. Kelly, Daniel R. Marsh, Maria‐Theresia Walach, Future Directions for Whole Atmosphere Modeling: Developments in the Context of Space Weather, Space Weather, 10.1029/2019SW002267, 17, 9, (1342-1350), (2019).
- William R. Wieder, David M. Lawrence, Rosie A. Fisher, Gordon B. Bonan, Susan J. Cheng, Christine L. Goodale, A. Stuart Grandy, Charles D. Koven, Danica L. Lombardozzi, Keith W. Oleson, R. Quinn Thomas, Beyond Static Benchmarking: Using Experimental Manipulations to Evaluate Land Model Assumptions, Global Biogeochemical Cycles, 10.1029/2018GB006141, 33, 10, (1289-1309), (2019).
- Gordon B. Bonan, Danica L. Lombardozzi, William R. Wieder, Keith W. Oleson, David M. Lawrence, Forrest M. Hoffman, Nathan Collier, Model Structure and Climate Data Uncertainty in Historical Simulations of the Terrestrial Carbon Cycle (1850–2014), Global Biogeochemical Cycles, 10.1029/2019GB006175, 33, 10, (1310-1326), (2019).
- Flavio Lehner, Andrew W. Wood, Julie A. Vano, David M. Lawrence, Martyn P. Clark, Justin S. Mankin, The potential to reduce uncertainty in regional runoff projections from climate models, Nature Climate Change, 10.1038/s41558-019-0639-x, 9, 12, (926-933), (2019).
- David M. Lawrence, Rosie A. Fisher, Charles D. Koven, Keith W. Oleson, Sean C. Swenson, Gordon Bonan, Nathan Collier, Bardan Ghimire, Leo Kampenhout, Daniel Kennedy, Erik Kluzek, Peter J. Lawrence, Fang Li, Hongyi Li, Danica Lombardozzi, William J. Riley, William J. Sacks, Mingjie Shi, Mariana Vertenstein, William R. Wieder, Chonggang Xu, Ashehad A. Ali, Andrew M. Badger, Gautam Bisht, Michiel Broeke, Michael A. Brunke, Sean P. Burns, Jonathan Buzan, Martyn Clark, Anthony Craig, Kyla Dahlin, Beth Drewniak, Joshua B. Fisher, Mark Flanner, Andrew M. Fox, Pierre Gentine, Forrest Hoffman, Gretchen Keppel‐Aleks, Ryan Knox, Sanjiv Kumar, Jan Lenaerts, L. Ruby Leung, William H. Lipscomb, Yaqiong Lu, Ashutosh Pandey, Jon D. Pelletier, Justin Perket, James T. Randerson, Daniel M. Ricciuto, Benjamin M. Sanderson, Andrew Slater, Zachary M. Subin, Jinyun Tang, R. Quinn Thomas, Maria Val Martin, Xubin Zeng, The Community Land Model Version 5: Description of New Features, Benchmarking, and Impact of Forcing Uncertainty, Journal of Advances in Modeling Earth Systems, 10.1029/2018MS001583, 11, 12, (4245-4287), (2019).
- Annette L. Hirsch, Jatin Kala, Claire C. Carouge, Martin G. De Kauwe, Giovanni Di Virgilio, Anna M. Ukkola, Jason P. Evans, Gab Abramowitz, Evaluation of the CABLEv2.3.4 Land Surface Model Coupled to NU‐WRFv3.9.1.1 in Simulating Temperature and Precipitation Means and Extremes Over CORDEX AustralAsia Within a WRF Physics Ensemble, Journal of Advances in Modeling Earth Systems, 10.1029/2019MS001845, 11, 12, (4466-4488), (2019).
- Christopher R Schwalm, Kevin Schaefer, Joshua B Fisher, Deborah Huntzinger, Yasin Elshorbany, Yuanyuan Fang, Daniel Hayes, Elchin Jafarov, Anna M Michalak, Mark Piper, Eric Stofferahn, Kang Wang, Yaxing Wei, Divergence in land surface modeling: linking spread to structure, Environmental Research Communications, 10.1088/2515-7620/ab4a8a, 1, 11, (111004), (2019).
- Susan J. Cheng, Peter G. Hess, William R. Wieder, R. Quinn Thomas, Knute J. Nadelhoffer, Julius Vira, Danica L. Lombardozzi, Per Gundersen, Ivan J. Fernandez, Patrick Schleppi, Marie-Cécile Gruselle, Filip Moldan, Christine L. Goodale, Decadal fates and impacts of nitrogen additions on temperate forest carbon storage: a data–model comparison, Biogeosciences, 10.5194/bg-16-2771-2019, 16, 13, (2771-2793), (2019).
- Pierre Friedlingstein, Matthew W. Jones, Michael O'Sullivan, Robbie M. Andrew, Judith Hauck, Glen P. Peters, Wouter Peters, Julia Pongratz, Stephen Sitch, Corinne Le Quéré, Dorothee C. E. Bakker, Josep G. Canadell, Philippe Ciais, Robert B. Jackson, Peter Anthoni, Leticia Barbero, Ana Bastos, Vladislav Bastrikov, Meike Becker, Laurent Bopp, Erik Buitenhuis, Naveen Chandra, Frédéric Chevallier, Louise P. Chini, Kim I. Currie, Richard A. Feely, Marion Gehlen, Dennis Gilfillan, Thanos Gkritzalis, Daniel S. Goll, Nicolas Gruber, Sören Gutekunst, Ian Harris, Vanessa Haverd, Richard A. Houghton, George Hurtt, Tatiana Ilyina, Atul K. Jain, Emilie Joetzjer, Jed O. Kaplan, Etsushi Kato, Kees Klein Goldewijk, Jan Ivar Korsbakken, Peter Landschützer, Siv K. Lauvset, Nathalie Lefèvre, Andrew Lenton, Sebastian Lienert, Danica Lombardozzi, Gregg Marland, Patrick C. McGuire, Joe R. Melton, Nicolas Metzl, David R. Munro, Julia E. M. S. Nabel, Shin-Ichiro Nakaoka, Craig Neill, Abdirahman M. Omar, Tsuneo Ono, Anna Peregon, Denis Pierrot, Benjamin Poulter, Gregor Rehder, Laure Resplandy, Eddy Robertson, Christian Rödenbeck, Roland Séférian, Jörg Schwinger, Naomi Smith, Pieter P. Tans, Hanqin Tian, Bronte Tilbrook, Francesco N. Tubiello, Guido R. van der Werf, Andrew J. Wiltshire, Sönke Zaehle, Global Carbon Budget 2019, Earth System Science Data, 10.5194/essd-11-1783-2019, 11, 4, (1783-1838), (2019).
- Ensheng Weng, Ray Dybzinski, Caroline E. Farrior, Stephen W. Pacala, Competition alters predicted forest carbon cycle responses to nitrogen availability and elevated CO<sub>2</sub>: simulations using an explicitly competitive, game-theoretic vegetation demographic model, Biogeosciences, 10.5194/bg-16-4577-2019, 16, 23, (4577-4599), (2019).
- Avni Malhotra, Katherine Todd-Brown, Lucas E Nave, Niels H Batjes, James R Holmquist, Alison M Hoyt, Colleen M Iversen, Robert B Jackson, Kate Lajtha, Corey Lawrence, Olga Vindušková, William Wieder, Mathew Williams, Gustaf Hugelius, Jennifer Harden, The landscape of soil carbon data: emerging questions, synergies and databases, Progress in Physical Geography: Earth and Environment, 10.1177/0309133319873309, (030913331987330), (2019).
- Eric Stofferahn, Joshua B Fisher, Daniel J Hayes, Christopher R Schwalm, Deborah N Huntzinger, Wouter Hantson, Benjamin Poulter, Zhen Zhang, The Arctic-Boreal vulnerability experiment model benchmarking system, Environmental Research Letters, 10.1088/1748-9326/ab10fa, 14, 5, (055002), (2019).
- Christoph Heinze, Veronika Eyring, Pierre Friedlingstein, Colin Jones, Yves Balkanski, William Collins, Thierry Fichefet, Shuang Gao, Alex Hall, Detelina Ivanova, Wolfgang Knorr, Reto Knutti, Alexander Löw, Michael Ponater, Martin G. Schultz, Michael Schulz, Pier Siebesma, Joao Teixeira, George Tselioudis, Martin Vancoppenolle, ESD Reviews: Climate feedbacks in the Earth system and prospects for their evaluation, Earth System Dynamics, 10.5194/esd-10-379-2019, 10, 3, (379-452), (2019).
- Corinne Le Quéré, Robbie M. Andrew, Pierre Friedlingstein, Stephen Sitch, Judith Hauck, Julia Pongratz, Penelope A. Pickers, Jan Ivar Korsbakken, Glen P. Peters, Josep G. Canadell, Almut Arneth, Vivek K. Arora, Leticia Barbero, Ana Bastos, Laurent Bopp, Frédéric Chevallier, Louise P. Chini, Philippe Ciais, Scott C. Doney, Thanos Gkritzalis, Daniel S. Goll, Ian Harris, Vanessa Haverd, Forrest M. Hoffman, Mario Hoppema, Richard A. Houghton, George Hurtt, Tatiana Ilyina, Atul K. Jain, Truls Johannessen, Chris D. Jones, Etsushi Kato, Ralph F. Keeling, Kees Klein Goldewijk, Peter Landschützer, Nathalie Lefèvre, Sebastian Lienert, Zhu Liu, Danica Lombardozzi, Nicolas Metzl, David R. Munro, Julia E. M. S. Nabel, Shin-ichiro Nakaoka, Craig Neill, Are Olsen, Tsueno Ono, Prabir Patra, Anna Peregon, Wouter Peters, Philippe Peylin, Benjamin Pfeil, Denis Pierrot, Benjamin Poulter, Gregor Rehder, Laure Resplandy, Eddy Robertson, Matthias Rocher, Christian Rödenbeck, Ute Schuster, Jörg Schwinger, Roland Séférian, Ingunn Skjelvan, Tobias Steinhoff, Adrienne Sutton, Pieter P. Tans, Hanqin Tian, Bronte Tilbrook, Francesco N. Tubiello, Ingrid T. van der Laan-Luijkx, Guido R. van der Werf, Nicolas Viovy, Anthony P. Walker, Andrew J. Wiltshire, Rebecca Wright, Sönke Zaehle, Bo Zheng, Global Carbon Budget 2018, Earth System Science Data, 10.5194/essd-10-2141-2018, 10, 4, (2141-2194), (2018).
- W. J. Riley, Q. Zhu, J. Y. Tang, Weaker land–climate feedbacks from nutrient uptake during photosynthesis-inactive periods, Nature Climate Change, 10.1038/s41558-018-0325-4, (2018).
- Gautam Bisht, William J. Riley, Glenn E. Hammond, David M. Lorenzetti, Development and evaluation of a variably saturated flow model in the global E3SM Land Model (ELM) version 1.0, Geoscientific Model Development, 10.5194/gmd-11-4085-2018, 11, 10, (4085-4102), (2018).





