Volume 10, Issue 11
Research Article
Open Access

The International Land Model Benchmarking (ILAMB) System: Design, Theory, and Implementation

Nathan Collier

Corresponding Author

E-mail address: nathaniel.collier@gmail.com

Climate Change Science Institute, Oak Ridge National Laboratory, Oak Ridge, TN, USA

Correspondence to: N. Collier,

E-mail address: nathaniel.collier@gmail.com

Search for more papers by this author
Forrest M. Hoffman

Climate Change Science Institute, Oak Ridge National Laboratory, Oak Ridge, TN, USA

Department of Civil and Environmental Engineering, University of Tennessee, Knoxville, Knoxville, TN, USA

Search for more papers by this author
David M. Lawrence

Climate and Global Dynamics Division, National Center for Atmospheric Research, Boulder, CO, USA

Search for more papers by this author
Gretchen Keppel‐Aleks

Department of Climate and Space Sciences and Engineering, University of Michigan, Ann Arbor, MI, USA

Search for more papers by this author
Charles D. Koven

Climate Sciences Department, Lawrence Berkeley National Laboratory, Berkeley, CA, USA

Search for more papers by this author
William J. Riley

Climate Sciences Department, Lawrence Berkeley National Laboratory, Berkeley, CA, USA

Search for more papers by this author
Mingquan Mu

Department of Earth System Science, University of California, Irvine, CA, USA

Search for more papers by this author
James T. Randerson

Department of Earth System Science, University of California, Irvine, CA, USA

Search for more papers by this author
First published: 12 October 2018
Citations: 39

Abstract

The increasing complexity of Earth system models has inspired efforts to quantitatively assess model fidelity through rigorous comparison with best available measurements and observational data products. Earth system models exhibit a high degree of spread in predictions of land biogeochemistry, biogeophysics, and hydrology, which are sensitive to forcing from other model components. Based on insights from prior land model evaluation studies and community workshops, the authors developed an open source model benchmarking software package that generates graphical diagnostics and scores model performance in support of the International Land Model Benchmarking (ILAMB) project. Employing a suite of in situ, remote sensing, and reanalysis data sets, the ILAMB package performs comprehensive model assessment across a wide range of land variables and generates a hierarchical set of web pages containing statistical analyses and figures designed to provide the user insights into strengths and weaknesses of multiple models or model versions. Described here is the benchmarking philosophy and mathematical methodology embodied in the most recent implementation of the ILAMB package. Comparison methods unique to a few specific data sets are presented, and guidelines for configuring an ILAMB analysis and interpreting resulting model performance scores are discussed. ILAMB is being adopted by modeling teams and centers during model development and for model intercomparison projects, and community engagement is sought for extending evaluation metrics and adding new observational data sets to the benchmarking framework.

1 Introduction

As Earth system models (ESMs) become increasingly complex and observational data volumes rapidly expand, there is a growing need for comprehensive and multifaceted evaluation of model fidelity. Process‐rich ESMs pose challenges to developers implementing new parameterizations or tuning process representations, and to the broader community seeking information about the skill of model predictions. Model developers and software engineers require a systematic means for evaluating changes in model results to ensure that developments improve the scientific performance of target process representations while not adversely affecting results in other, possibly less familiar, parts of the model. To advance understanding and predictability of terrestrial biogeochemical processes and their interactions with hydrology and climate under conditions of increasing atmospheric carbon dioxide, rigorous analysis methods, employing best available observational data, are required to objectively assess and constrain model predictions, inform model development, and identify needed measurements and field experiments (Hoffman et al., 2017).

Building upon past model evaluation work (Randerson et al., 2009), we developed an extensible model benchmarking package in support of the goals of the International Land Model Benchmarking (ILAMB; https://www.ilamb.org/) activity. ILAMB's goals are to
  1. develop internationally accepted benchmarks for land model performance by drawing upon international expertise and collaboration;
  2. promote the use of these benchmarks by the international community for model intercomparison and development;
  3. strengthen linkages among experimental, remote sensing, and climate modeling communities in the design of new model tests, benchmarks, and measurement programs; and
  4. support the design and development of a new, open source, benchmarking software system for use by the international community.

Three ILAMB workshops have been held—in Exeter, UK, in 2009; Irvine, California, USA, in 2011 (Luo et al., 2012); and Washington, DC, USA, in 2016 (Hoffman et al., 2017)—to engage the modeling, measurements, and remote sensing communities in the identification of observational data sets and the design of model evaluation metrics. In this way, community consensus was sought for the curation of observational data and the methodology of model evaluation and scoring, which are described below.

Recognition that the capacities of the terrestrial and marine biosphere to store anthropogenic carbon will weaken under climate warming (Cox et al., 2000; Denman et al., 2007; Friedlingstein et al., 2001; Fung et al., 2005; Mahowald et al., 2017; Moore et al., 2018; Randerson et al., 2015) and that uncertainties in carbon cycle feedbacks must be quantified and reduced to improve projections of future climate change (Arora et al., 2013; Ciais et al., 2013; Friedlingstein et al., 2006, 2014; Gregory et al., 2009; Hoffman et al., 2014) has inspired efforts to quantitatively evaluate model performance through comparison with in situ and remote sensing observations (Anav et al., 2013; Eyring et al., 2016). Multimodel simulation results from the third Coupled Model Intercomparison Project (CMIP3; Meehl et al., 2007) and fifth CMIP (CMIP5; Taylor et al., 2012), which informed the Intergovernmental Panel on Climate Change Fourth and Fifth Assessment Reports (AR4 and AR5), provided opportunities for developing and testing model evaluation diagnostics, formal metrics, and exploration of benchmarking concepts and techniques. Early work on coupled model evaluation and establishing formal metrics focused primarily on atmospheric variables (Gleckler et al., 2008; Reichler & Kim, 2008). Following the first two ILAMB workshops, the land modeling community began exploring standardized and comprehensive benchmarking for terrestrial carbon cycle models (Abramowitz, 2012; Anav et al., 2013; Blyth et al., 2011; Bouskill et al., 2014; Cadule et al., 2010; Dalmonech & Zaehle, 2013; Ghimire et al., 2016; Kelley et al., 2013; Piao et al., 2013). While some researchers define benchmarking as a series of model tests based on a predefined expected level of performance (Abramowitz, 2005; Best et al., 2015), most of the systematic benchmarking strategies explored by the land modeling community to date do not depend upon the establishment of an expected level of performance.

The ILAMB software package, hereafter referred to as ILAMB, shares some of the same goals as existing model diagnostic and evaluation tools, such as the Protocol for the Analysis for Land Surface models (Abramowitz, 2012), the Program for Climate Model Diagnosis and Intercomparison Metrics Package (Gleckler et al., 2016), the ESM Evaluation Tool (Eyring et al., 2016), the Land surface Verification Toolkit (Kumar et al., 2012), and a wide variety of often custom‐developed diagnostic packages in use at international modeling centers. Some of these tools provide model‐to‐model comparisons, a large collection of stand‐alone graphical diagnostics, or workflow infrastructure that allows one to regenerate analysis results from previously published studies but with new model outputs. In contrast, ILAMB was designed to compare multiple models or model versions with observations simultaneously, assess functional relationships between prognostic variables and one or more forcing variables through variable‐to‐variable comparisons (e.g., gross primary production vs. precipitation), and score model performance across a suite of metrics, variables, and data sets. Model performance is evaluated for variables in categories of biogeochemistry (Table 2), hydrology (Table 3), radiation and energy (Table 4), and climate forcing (Table 5).

For every variable, ILAMB generates graphical diagnostics (spatial contour maps, time series line plots, and Taylor diagrams; Taylor, 2001) and scores model performance for the period mean, bias, root‐mean‐square error (RMSE), spatial distribution, interannual coefficient of variation, seasonal cycle, and long‐term trend. Model performance scores are calculated for each metric and variable and are scaled based on the degree of certainty of the observational data set, the scale appropriateness, and the overall importance of the constraint or process to model predictions, following a customizable rubric described below (Table 1). Scores are aggregated across metrics and data sets, producing a single scalar score for each variable for every model or model version. As shown in Figure 1, these scalar scores are presented graphically. On the left side we use a stoplight color scheme to indicate aggregate performance for each model by variable. On the right, we show relative performance (i.e., Z score), indicating which models or model versions perform better with respect to others contained in the overall analysis.

Table 1. The ILAMB Rubric Used to Assign Relative Weights of a Data Set
Score Certainty Scale Process
No given uncertainty, Site level observations with Observations that have
1 significant methodological limited space/time coverage limited influence on the
issues affecting quality targeted Earth system dynamics
No given uncertainty, some Partial regional coverage, Observations have direct
2 methodological issues up to 1 year influence on the targeted
affecting quality Earth system dynamics
No given uncertainty, Regional coverage, Observations useful to constrain
3 methodology has some at least 1 year processes that contribute
peer review to the targeted Earth system dynamics
4 Qualitative uncertainty, Important regional coverage, Observations well suited to
methodology accepted at least 1 year constrain important processes
Well‐defined and relatively Global scale spanning Observations well suited for
5 low uncertainty multiple years discriminating critical
processes among models
  • Note. A score for each data set is assigned in each of three areas. These scores are then combined multiplicatively and used to determine relative importance for a data set with respect to a given variable. ILAMB = International Land Model Benchmarking.
jame20779-fig-0001
The International Land Model Benchmarking top‐level graphic uses stoplight colors to show how different models or model versions (across the top) score with respect to each variable (down the left) in an absolute sense (left rectangle) and with respect to each other (right rectangle). Gray boxes reflect missing or unavailable data.

We do not view these aggregate absolute scores as a determinant of good or bad models. We envision the scores as a tool to more quickly identify relative differences among models and model versions which the scientist must then interpret. As in any evaluation methodology, many of our choices are subjective and must be considered as the scores are interpreted. Where possible, the ILAMB implementation allows for users to customize weights and diagnostics in order to incorporate aspects of model performance relevant to their scientific goals. ILAMB may be thought of as a framework which may be expanded to incorporate community ideas regarding model benchmarking. Thus, while our choices are subjective, they are informed by the preferences of a larger community and can be considered as an initial suggestion.

The remainder of this paper describes the ILAMB methodology used to compute aggregate absolute scores. First we describe how we compare an individual observational data set to model output (section 2). Then we explain how scores are aggregated across data sets for each variable and present the data sets used in the land model evaluation (section 3). In section 4 we present some salient points about how the ILAMB software is designed. Finally, in section 5 we discuss what ILAMB scores mean and how they should be used.

2 Methodology

In this section we describe the methodology used to assess how well a model captures information contained in a reference (e.g., observational) data set. For the purposes of this section, we discuss the analysis of a generalized variable v(t,x), which we assume represents a piecewise discontinuous function of constants in space and time. This means that the temporal domain, represented by the variable t, is defined by the beginning and ending of time intervals and the spatial domain, represented by the variable x (in bold to emphasize it is a vector quantity), represents the areas created by cell boundaries or the areas associated with data sites. When necessary, we use the subscript ref to reflect a variable whose source is a reference or observational data set, and the subscript mod for model data sets.

While many statistical quantities may be computed, the goal of our initial methodology is to examine the mean state and variability around the mean over monthly to decadal time scales and grid cell to global spatial scales. While we intend to uniformly apply this analysis procedure to all variables, we also implement a mechanism to skip certain aspects when deemed inappropriate. For example, if a reference data set only contains average information across a span of years, the annual cycle is undefined and automatically skipped in our implementation. The implementation also allows users to skip aspects of the analysis that are deemed inappropriate even if it is possible to compute these metrics using the available data. For example, the interannual variability may be poorly characterized in a reference data set even though the quantity could be computed.

2.1 Preliminary Definitions

Before presenting the specifics of the ILAMB methodology, we first present some definitions used throughout the paper. While the following definitions are widely used in the community, there are many subtle choices in their implementation that affect the interpretation of the results. We present them here with precise meanings to emphasize where a choice has been made and our reasoning for making it.

2.1.1 Mean Values Over Time

When calculating mean values over the time period of the benchmark data set, denoted by a bar superscribing the variable, we use the midpoint quadrature rule to approximate the integral,
urn:x-wiley:jame:media:jame20779:jame20779-math-0001(1)
where n represents the number of time intervals on which v is defined between the initial time, t0, and the final time, tf, and Δti is the size of the ith time interval, modified to exclude time, which falls outside of the integral limits,
urn:x-wiley:jame:media:jame20779:jame20779-math-0002(2)
where urn:x-wiley:jame:media:jame20779:jame20779-math-0003 and urn:x-wiley:jame:media:jame20779:jame20779-math-0004 are the initial and final times of each time interval. The average value is obtained by dividing through by the amount of time in the interval, tft0, replaced in our discrete approximation by the following function.
urn:x-wiley:jame:media:jame20779:jame20779-math-0005(3)

In words, equation 3 addresses temporally discontinuous data by summing all the time step interval sizes only if the corresponding variable data are marked as valid. This means that if a function has some values masked or marked as invalid at some locations, we do not penalize the averaged value by including this as a time at which a value is expected. If an integral (or sum) is desired instead of an average, then we simply omit the division by T(x) in equation 1.

2.1.2 Mean Values Over Space

When computing spatial means over various regions of interest, denoted by a double bar over a variable, we use the midpoint rule for integration to approximate the following weighted spatial integral,
urn:x-wiley:jame:media:jame20779:jame20779-math-0009(4)
over a region Ω, also referred to as a area‐weighted mean. Here the function w(x) is an optional generic weighting function defined over space. The summation is over n(Ω), that is, the integer number of spatial cells whose centroids fall into the region of interest. A function evaluation at a location xi refers to the constant value which corresponds to that spatial cell. The value of ai is the area of the cell, which could be some fraction of the total cell area if integrating over land in coastal regions. We then divide through by the measure, the sum of the grid areas with the weights,
urn:x-wiley:jame:media:jame20779:jame20779-math-0010(5)

Note that if no weighting is required, this is a normalization by the sum of the area over which we integrate. As with the temporal mean, if an integral only is required, we simply omit the division by A(Ω). In cases where a mean over a collection of sites is needed, the spatial integral reduces to an arithmetic mean across the sites.

If we are spatially integrating a variable from a single source, then its spatial grid is clearly defined and equation 4 can be directly applied to compute the quantity of interest. However, if the integrand involves quantities from two different sources, as in computing the global bias or RMSE, then there is likely a disparity in both resolution and representation of land areas. We address resolution differences by interpolating both sources to a grid composed of the cell breaks, the location at which two neighboring cells meet, of both data sources. Consider two spatial grids whose cells are defined by the outer product of 1‐D vectors representing the cell breaks in spherical coordinates,
urn:x-wiley:jame:media:jame20779:jame20779-math-0011(6)
urn:x-wiley:jame:media:jame20779:jame20779-math-0012(7)
where θ refers to the latitude, φ to longitude, and ⊗ a operator which creates a two‐dimensional grid from one‐dimensional vectors. We address differences in resolution by defining a composite grid, which consists of the outer product of the union of these two grids' cell breaks,
urn:x-wiley:jame:media:jame20779:jame20779-math-0013(8)

Once constructed, quantities defined on both urn:x-wiley:jame:media:jame20779:jame20779-math-0014 and urn:x-wiley:jame:media:jame20779:jame20779-math-0015 may be interpolated to urn:x-wiley:jame:media:jame20779:jame20779-math-0016 by nearest neighbor interpolation with zero interpolation error due to the nested nature of the grids. This can be seen visually by comparing the three plots shown in Figure 2a. In each plot, the tick marks along the x axis represent the cell breaks of the particular one‐dimensional grid left coarse for illustration. The cyan curve represents a step function defined on the grid of a reference data set urn:x-wiley:jame:media:jame20779:jame20779-math-0017 and the magenta curve on that of the model data set urn:x-wiley:jame:media:jame20779:jame20779-math-0018. Both are interpolated to the composed grid urn:x-wiley:jame:media:jame20779:jame20779-math-0019 without loss of information, albeit on a new grid containing more cells of variable size. Once on a composite grid, the quantities may be compared directly. As the ILAMB methodology has been envisioned for comparisons with model output from CMIP5, we have made an implicit assumption that each source grid, urn:x-wiley:jame:media:jame20779:jame20779-math-0020 and urn:x-wiley:jame:media:jame20779:jame20779-math-0021, is regular and can be represented by one‐dimensional vectors. While the implementation does provide naive interpolation for nonregular grids, the user is encouraged to employ a conservative interpolation scheme of their choosing prior to applying the ILAMB methodology.

jame20779-fig-0002
When comparing two spatial variables of varying resolution, we interpolate both to a common grid composed of the cell breaks of both variables over the intersection of what both variables agree is land. (a) Interpolation of sample step functions defined on grids urn:x-wiley:jame:media:jame20779:jame20779-math-0006 and urn:x-wiley:jame:media:jame20779:jame20779-math-0007 both interpolated to a composite grid urn:x-wiley:jame:media:jame20779:jame20779-math-0008 using nearest neighbor interpolation with zero interpolation error. The vertical grid lines reflect the cell boundaries in each grid. (b) Differences in the representation of land from a reference and model data set zoomed into Central America for emphasis. The red region represents where both sources are in agreement, the blue is land for the model but not the reference, and the green is land for the reference but not the model.

In addition to resolution differences, we observe that data sources vary in the underlying representation of the distinction between land and water. We illustrate this concept in Figure 2b where we compare a fine scale representation of land urn:x-wiley:jame:media:jame20779:jame20779-math-0022 to a relatively coarse representation urn:x-wiley:jame:media:jame20779:jame20779-math-0023. This is a typical situation encountered when comparing high‐resolution observational data to lower‐resolution model output. The red region represents the intersection of land areas urn:x-wiley:jame:media:jame20779:jame20779-math-0024, that is, where both sources report the presence of land. However, there are missed land areas from both sources, represented by the blue and green colors. As much of the disagreement over what is considered land occurs around islands in tropical regions (e.g., Central America and Equatorial Asia), these nonrepresented areas can constitute a nontrivial percentage of the total represented variable v.

For transparency, the ILAMB implementation is built with the capability of reporting integrals over each of these three land areas. Unless specifically stated otherwise, when spatially integrating a quantity from a single source, we use the original grid and land areas given by that source. This is to remain as true to the original intent of the provider as we can. However, when comparing two data sources of varying resolution and land representation, we perform this integration over what both report to be land, urn:x-wiley:jame:media:jame20779:jame20779-math-0025 (the red area in Figure 2b).

2.1.3 Computing Normalized Scores From Errors

In sections 2.2 and 2.3, we detail how we compute errors and transform them into normalized scores on the unit interval. This approach is intended to synthesize model performance across a range of dimensions with respect to a given data set. We achieve this by taking a measure of the relative error, generically represented here as ϵ, and passing it through the exponential function,
urn:x-wiley:jame:media:jame20779:jame20779-math-0026(9)
where s is a score on the interval [0,1] and α is a parameter which can be used to tune the mapping of error to score. The classic expression of relative error is prone to numerical instabilities for denominator values near or which cross zero. Furthermore, the magnitude of the error can depend on the units selected. For this reason we depart from the standard definition of relative error and develop specialized expressions in equations 13-18, and 26.
While the choice of the exponential function is arbitrary, it was chosen because it maps zero error to a score of one and smoothly reduces the score as the error grows, never reaching exactly zero. This is important as we want to improve the score when the error improves, no matter how large of error we observe. If the user wants a relative error of urn:x-wiley:jame:media:jame20779:jame20779-math-0027 to equate to a score of urn:x-wiley:jame:media:jame20779:jame20779-math-0028, then
urn:x-wiley:jame:media:jame20779:jame20779-math-0029(10)

In Figure 3 we plot this function with two choices for α, which illustrates how the relative error may be controlled. Unless stated otherwise, we use an implicit α = 1 throughout the manuscript.

jame20779-fig-0003
Mapping function of relative error ϵ to a score s on the unit interval. Two choices of α are shown: α = 1, shown in blue, which equates a score of 0.6 to a relative error of 50%; and α = 2.3, shown in orange, which equates a score of 0.1 to a relative error of 100%.

2.2 Mean State Analysis

In this section, we describe the various metrics and plots that our methodology generates. While presented in terms of the abstract variable v, we also include sample plots of a comparison of the GBAF (Jung et al., 2010) gross primary productivity (GPP) with CLM4.5 (Oleson et al., 2013) for the purpose of illustration. In practice, ILAMB produces thousands of such plots and scalars, which are browsable in a website designed to aid modelers in understanding the benchmarking results.

2.2.1 Bias

We find the mean value in time, urn:x-wiley:jame:media:jame20779:jame20779-math-0032, over the time period of the reference, as well as that of the model, urn:x-wiley:jame:media:jame20779:jame20779-math-0033, over the same time period. These are spatial variables that are included in the standard output as plots, as shown in Figures 4a and 4b. We also compute the bias,
urn:x-wiley:jame:media:jame20779:jame20779-math-0034(11)
as well as its mean over a given region, urn:x-wiley:jame:media:jame20779:jame20779-math-0035. To score the bias, we need to nondimensionalize it as a relative error. We have chosen to do this by using the centralized RMS of the reference data,
urn:x-wiley:jame:media:jame20779:jame20779-math-0036(12)
which makes the relative error in bias given as
urn:x-wiley:jame:media:jame20779:jame20779-math-0037(13)
where the || operator represents the absolute value. The bias score as a function of space is
urn:x-wiley:jame:media:jame20779:jame20779-math-0038(14)
and the scalar score
urn:x-wiley:jame:media:jame20779:jame20779-math-0039(15)
that is, the spatially integrated bias score. The motivation behind equation 13 is to normalize the bias by the variability at any given spatial location. However, this also leads to the consequence that in areas where the given variable v has a small magnitude, simple noise can lead to large relative errors. For example, in Figure 4d we observe a poor score in the dry regions of Australia where GPP is small. Given the small contribution, it is undesirable that these errors induce a large negative contribution to the overall score. To address this issue, we introduce the concept of mass weighting. That is, when performing the spatial integral to obtain a scalar score (equation 15), we weigh the integral by the period mean value of the reference variable using equation 4 with urn:x-wiley:jame:media:jame20779:jame20779-math-0040. In some instances the variable is truly a mass, but other times a flux or rate. The main motivation is to weigh in areas where the variable is active. So while in our conceptual example, there is large relative error in GPP over deserts, these values will not negatively contribute to the overall score as the value of GPP is low in this area.
jame20779-fig-0004
Comparisons of gross primary productivity between the reference (GBAF) and the model (CLM4.5) data sets. Each period mean is plotted over the original grid of the data set. We highlight here that the reference (a) is not defined over Antarctica, Greenland, and part of the Sahara desert, whereas the model (b) is defined over all land areas. Yet when the bias (c) and its score (d) is reported, the area represented is what both the reference and model agree on as land. (a) Reference period mean, urn:x-wiley:jame:media:jame20779:jame20779-math-0030; (b) model period mean, urn:x-wiley:jame:media:jame20779:jame20779-math-0031; (c) bias, bias(x); (d) bias score, sbias(x).

We apply mass weighting when the variable v represents a mass or flux of carbon or water as in GPP or precipitation. For variables representing energy states or quantities, such as temperature and radiation, we omit the weighting and perform a spatial integral only. We report plots of the bias and its score as well as the scalar integrated mean values.

2.2.2 RMSE

For reference data sets with seasonal and interannual variability, we compute the RMSE over the time period of the reference data set,
urn:x-wiley:jame:media:jame20779:jame20779-math-0045(16)
and include plots and the scalar urn:x-wiley:jame:media:jame20779:jame20779-math-0046 in the standard output (Figure 5a). To score the RMSE, we normalize the centralized RMSE,
urn:x-wiley:jame:media:jame20779:jame20779-math-0047(17)
by the centralized RMS of the reference data set, equation 12. This leads to a relative error of
urn:x-wiley:jame:media:jame20779:jame20779-math-0048(18)
and a spatial RMSE score
urn:x-wiley:jame:media:jame20779:jame20779-math-0049(19)
jame20779-fig-0005
Comparisons of the root‐mean‐square error (RMSE) and phase of gross primary productivity between the reference (GBAF) and the model (CLM4.5) data sets. (a) RMSE, rmse(x); (b) RMSE score, srmse(x); (c) phase shift, θ(x); (d) phase shift score, scycle(x).
The scalar score is obtained by
urn:x-wiley:jame:media:jame20779:jame20779-math-0050(20)
where we again employ mass weighting when necessary. We score the centralized RMSE to decouple the bias score from the RMSE score. Computing the RMSE score by normalizing the RMSE would lead to a double counting of errors. That is, a large error in bias also leads to a large error in RMSE. By scoring the centralized RMSE, we remove the bias from the RMSE, allowing the RMSE score to focus on an orthogonal aspect of model performance.

2.2.3 Phase Shift

We evaluate the phase shift of the annual cycle of many data sets that have monthly variability by comparing the timing of the maximum of the annual cycle of the variable, c(v), at each spatial cell across the time period of the reference data set. We then approximate the phase shift of the reference and model data sets by subtracting these two values,
urn:x-wiley:jame:media:jame20779:jame20779-math-0051(21)
expressed in days. As the units for phase shift are consistent across all variables, no normalization is needed and we can remap the shift to the unit interval by
urn:x-wiley:jame:media:jame20779:jame20779-math-0052(22)
and then spatially integrate the score over the appropriate region to find the scalar score,
urn:x-wiley:jame:media:jame20779:jame20779-math-0053(23)
where again mass weighting is employed when appropriate. We include plots of the phase shift and its score in the standard output and represent them here in Figures 5c and 5d. In addition to plots which show the time averaged variables as a map, we include line plots of the mean annual cycle and the spatially averaged variables, urn:x-wiley:jame:media:jame20779:jame20779-math-0054 and urn:x-wiley:jame:media:jame20779:jame20779-math-0055 shown in Figure 6.
jame20779-fig-0006
Spatial means of gross primary productivity of the reference (GBAF) shown in gray and the model (CLM4.5) in maroon. (a) Spatially integrated mean, urn:x-wiley:jame:media:jame20779:jame20779-math-0041 and urn:x-wiley:jame:media:jame20779:jame20779-math-0042; (b) mean annual cycle, urn:x-wiley:jame:media:jame20779:jame20779-math-0043 and urn:x-wiley:jame:media:jame20779:jame20779-math-0044.

2.2.4 Interannual Variability

A score for the interannual variability is computed by removing the annual cycle from both the reference and the model,
urn:x-wiley:jame:media:jame20779:jame20779-math-0056(24)
urn:x-wiley:jame:media:jame20779:jame20779-math-0057(25)
urn:x-wiley:jame:media:jame20779:jame20779-math-0058(26)
and then computing a score as a function of space,
urn:x-wiley:jame:media:jame20779:jame20779-math-0059(27)
The scalar score is then obtained by
urn:x-wiley:jame:media:jame20779:jame20779-math-0060(28)
where mass weighting is used when necessary. We include plots of the variability and the score in the standard output and show them here in Figure 7. Note that while here we have shown the interannual variability of the GBAF product for illustration, in the default ILAMB configuration, the interannual variability is currently omitted for the GBAF products because its representativeness is considered to be poor (see Figure 10 of; Kumar et al., 2016).
jame20779-fig-0007
Comparisons of the interannual variability of gross primary productivity between the reference (GBAF) and the model (CLM4.5) data sets. (a) Reference interannual variability, iavref(x); (b) model interannual variability, iavmod(x); (c) interannual variability score, siav(x).

2.2.5 Spatial Distribution

We score the spatial distribution of the time averaged variable by generating a Taylor (Taylor, 2001) diagram. We do this by computing the normalized standard deviation,
urn:x-wiley:jame:media:jame20779:jame20779-math-0061(29)
and the spatial correlation R of the period mean values urn:x-wiley:jame:media:jame20779:jame20779-math-0062 and urn:x-wiley:jame:media:jame20779:jame20779-math-0063, and then assigning a score by the following relationship
urn:x-wiley:jame:media:jame20779:jame20779-math-0064(30)
where the main idea is that we penalize the score when R and σ deviate from a value of 1. We include the Taylor plot in the standard output and represent it here in Figure 8.
jame20779-fig-0008
Taylor diagram comparing the spatial distribution of gross primary productivity of the reference (GBAF) shown as a black star to the CMIP5 models shown in colors.

2.2.6 Overall Score

The overall score for a given variable and data product is a composite of the suite of metrics defined above. We use a weighted sum,
urn:x-wiley:jame:media:jame20779:jame20779-math-0065(31)
where the RMSE score is doubly weighted to emphasize its importance.

2.3 Relationship Analysis

As models are frequently calibrated using the mean state scalar measures described in section 2.2, a higher score does not necessarily reflect a more process‐oriented model. In order to assess the representation of mechanistic processes in models, we also evaluate variable‐to‐variable relationships. For example, we look at how well models represent the relationship that GPP has with precipitation, evapotranspiration, and temperature. For the purposes of this section, we represent a generic dependent variable as v, as before, and score its relationship with an independent variable u. We then quantify the variable‐to‐variable relationship of the time period mean, urn:x-wiley:jame:media:jame20779:jame20779-math-0066 on urn:x-wiley:jame:media:jame20779:jame20779-math-0067, derived from the combination of reference data sets to the relationship diagnosed in models. We use the mean values over the reference time period to establish relationships as they represent a logical starting point. In the future, we plan to extend the relationship analysis to include seasonal and interannual variability.

2.3.1 Functional Response

We estimate a functional response by a 1‐D histogram, binned in terms of the independent variable urn:x-wiley:jame:media:jame20779:jame20779-math-0068 with a number of bins, initially set to nbins=25. Then in each bin, we compute the mean value of the corresponding dependent variable, urn:x-wiley:jame:media:jame20779:jame20779-math-0069 to approximate the functional dependence of u on v. We represent this binning with the operator urn:x-wiley:jame:media:jame20779:jame20779-math-0070 that operates on the dependent and independent variables. We use it to compute functions from both the reference and model data sets.
urn:x-wiley:jame:media:jame20779:jame20779-math-0071(32)
urn:x-wiley:jame:media:jame20779:jame20779-math-0072(33)
where both curves are plotted in Figure 9a for the case of GPP compared to surface air temperature. These response curves are then scored by computing a relative error based on the RMSE,
urn:x-wiley:jame:media:jame20779:jame20779-math-0073(34)
where the integrals are approximated by the midpoint rule over the bins of the independent variable urn:x-wiley:jame:media:jame20779:jame20779-math-0074. Then we use equation 9 to map this relative error to a score by
urn:x-wiley:jame:media:jame20779:jame20779-math-0075(35)
jame20779-fig-0009
Variable‐to‐variable relationship plots which are a part of the standard output from the International Land Model Benchmarking methodology. (a) Functional responses, the reference fref(u) in black, and the model fmod(u) in maroon. Data points reflect the mean for each independent value and the error bars reflect the standard deviation range. (b) Reference distribution, dref(u); (c) model distribution, dmod(u).

The superscript u reinforces that this score represents functional performance with respect to a given independent variable u. The ILAMB implementation allows for any number of independent variables to be studied. In terms of our sample, ILAMB scores the functional relationship of GPP with respect to each independent variable separately (precipitation, evapotranspiration, temperature, etc.) and then computes the mean of these scores for the overall relationship score.

2.3.2 Hellinger Distance

In addition to the one‐dimensional histograms, we also build normalized two‐dimensional histograms (nbins=25 in both dimensions) from the time averaged data urn:x-wiley:jame:media:jame20779:jame20779-math-0076 and urn:x-wiley:jame:media:jame20779:jame20779-math-0077, represented here by the operator urn:x-wiley:jame:media:jame20779:jame20779-math-0078. We represent these distributions by
urn:x-wiley:jame:media:jame20779:jame20779-math-0079(36)
urn:x-wiley:jame:media:jame20779:jame20779-math-0080(37)
as depicted in Figures 9b and 9c. If we represent individual elements from these distributions urn:x-wiley:jame:media:jame20779:jame20779-math-0081 and urn:x-wiley:jame:media:jame20779:jame20779-math-0082, we can compute the so‐called Hellinger distance (Law et al., 2015)
urn:x-wiley:jame:media:jame20779:jame20779-math-0083(38)
as a measure of how similar two distributions are to each other. While there are other choices, such as the Kullback‐Leibler divergence, which are more commonly employed (Dirmeyer et al., 2014), the Hellinger distance comes with the added benefit of being already normalized [0,1] and, thus, further normalization is not necessary to use this directly as a score.

However, we only report the Hellinger distance as a scalar and do not include it in the scoring of the relationships. This is due to the fact that a bias in an independent variable can cause a density shift in the 2‐D distribution that would cause the score to unreasonably decrease. In terms of our example, a bias in precipitation (e.g., arising from a coupled model) could result in a poor relationship score with GPP, even if there is no underlying deficiency in the land‐model‐simulated precipitation versus GPP relationship.

3 Data Sets

In this section we explain how we utilize the methodology presented in section 2 to evaluate model performance with respect to a collection of data sets (Tables 2– 5) assembled by the ILAMB community. Errors in measurements, lack of measured or reported uncertainties, and inconsistencies in measurement methodology or instrumentation leading to ambiguous confidence in derived or synthesized data products all represent challenges in using observational data for benchmarking. In addition, the spatial and temporal coverage of different data products can vary substantially.

Table 2. References and Weighting of Data Sets Used to Measure the Ecosystem and Carbon Cycle
Variable/Data Set Certainty Scale Process
Biomass 5
Tropical (Saatchi et al., 2011) 4 4
NBCD2000 (Kellndorfer et al., 2013) 4 2
USForest (Blackard et al., 2008) 4 2
Burned area 4
GFED4S (Giglio et al., 2010) 4 5
Gross primary productivity 5
Fluxnet (Lasslop et al., 2010) 3 3
GBAF (Jung et al., 2010) 3 5
Leaf area index 3
AVHRR (Myneni et al., 1997) 3 5
MODIS (De Kauwe et al., 2011) 3 5
Global net ecosystem carbon balance 5
GCP (Le Quéré et al., 2016) 4 5
Hoffman (Hoffman et al., 2014) 4 5
Net ecosystem exchange 5
Fluxnet (Lasslop et al., 2010) 3 3
GBAF (Jung et al., 2010) 2 2
Ecosystem respiration 4
Fluxnet (Lasslop et al., 2010) 2 3
GBAF (Jung et al., 2010) 2 2
Soil carbon 5
HWSD (Todd‐Brown et al., 2013) 3 5
NCSCDV22 (Hugelius et al., 2013) 3 4
  • Note. Weights are chosen using the rubric in Table 1 and reflect a focus on understanding the carbon cycle.
Table 3. References and Weighting of Data Sets Used to Measure the Hydrology Cycle
Variable/Data Set Certainty Scale Process
Evapotranspiration 5
GLEAM (Miralles et al., 2011) 3 5
MODIS (De Kauwe et al., 2011) 3 5
Evaporative fraction 5
GBAF (Jung et al., 2010) 3 3
Latent heat 5
Fluxnet (Lasslop et al., 2010) 3 1
GBAF (Jung et al., 2010) 3 3
Runoff 5
Dai (Dai & Trenberth, 2002) 3 5
Sensible heat 2
Fluxnet (Lasslop et al., 2010) 3 3
GBAF (Jung et al., 2010) 3 5
Terrestrial water storage anomaly 5
GRACE (Swenson & Wahr, 2006) 5 5
  • Note. Weights are chosen using the rubric in Table 1 and reflect a focus on understanding the carbon cycle.
Table 4. References and Weighting of Data Sets Used to Measure the Radiation and Energy Cycle
Variable/Data Set Certainty Scale Process
Albedo 1
CERES (Kato et al., 2013) 4 5
GEWEX.SRB (Stackhouse et al., 2011) 4 5
MODIS (De Kauwe et al., 2011) 4 5
Surface upward SW radiation 1
CERES (Kato et al., 2013) 4 4
GEWEX.SRB (Stackhouse et al., 2011) 4 5
WRMC.BSRN (König‐Langlo et al., 2013) 4 3
Surface net SW radiation 1
CERES (Kato et al., 2013) 4 5
GEWEX.SRB (Stackhouse et al., 2011) 4 5
WRMC.BSRN (König‐Langlo et al., 2013) 4 3
Surface upward LW radiation 1
CERES (Kato et al., 2013) 4 5
GEWEX.SRB (Stackhouse et al., 2011) 4 5
WRMC.BSRN (König‐Langlo et al., 2013) 4 3
Surface net LW radiation 1
CERES (Kato et al., 2013) 4 5
GEWEX.SRB (Stackhouse et al., 2011) 4 5
WRMC.BSRN (König‐Langlo et al., 2013) 4 3
Surface net radiation 2
CERES (Kato et al., 2013) 4 5
Fluxnet (Lasslop et al., 2010) 4 3
GEWEX.SRB (Stackhouse et al., 2011) 4 5
WRMC.BSRN (König‐Langlo et al., 2013) 4 3
  • Note. Weights are chosen using the rubric in Table 1 and reflect a focus on understanding the carbon cycle.
Table 5. References and Weighting of Data Sets Used to Measure the Forcings
Variable/Data Set Certainty Scale Process
Surface air temperature 2
CRU (Harris et al., 2014) 5 5
Fluxnet (Lasslop et al., 2010) 3 3
Precipitation 2
CMAP (Xie & Arkin, 1997) 4 5
Fluxnet (Lasslop et al., 2010) 3 3
GPCC (Schneider et al., 2014) 4 5
GPCP2 (Adler et al., 2012) 4 5
Surface relative humidity 3
ERA (Dee et al., 2011) 2 5
Surface downward SW radiation 2
CERES (Kato et al., 2013) 4 5
Fluxnet (Lasslop et al., 2010) 4 3
GEWEX.SRB (Stackhouse et al., 2011) 4 5
WRMC.BSRN (König‐Langlo et al., 2013) 4 3
Surface downward LW radiation 1
CERES (Kato et al., 2013) 4 5
GEWEX.SRB (Stackhouse et al., 2011) 4 5
WRMC.BSRN (König‐Langlo et al., 2013) 4 3
  • Note. Weights are chosen using the rubric in Table 1 and reflect a focus on understanding the carbon cycle.
To account for the lack of quantitative uncertainties and scale mismatches between observations and models, and to bring a quantitative objectivity to model‐data comparison, we developed a three‐element rubric for weighting data sets as represented in Table 1. The first weight is based on a qualitative estimate of the certainty we have in a particular data set. This weight encompasses both our certainty in the process used to obtain the observational information as well as the presence of quantitative uncertainty in the measurements themselves. A second weight for each data set reflects its spatial and temporal coverage. The data sets employed in ILAMB are diverse and include site level data, reanalysis data products, and remotely sensed data. As our aim is to provide insight in land model performance on global and decadal scales, we give more weight to global products, which are time series that extend for several years. The weights are combined multiplicatively to assign a total weight for each data set. Then we normalize the weight by the sum of the weights of all the data sets for a given variable. For example, from Table 2 we see that there are two data sets used to benchmark GPP: Fluxnet and GBAF. For the Fluxnet product, we assign a certainty weight of 3 because while the collection is discussed in the published literature, there is no quantitative uncertainty provided. We assign a scale weight of 3 because the collection of sites covers multiple years of a substantial region of the globe yet has sparse coverage over important regions such as the tropics. The GBAF product is assigned a certainty weight of 3 for the same reason and a scale weight of 5 as it provides global coverage spanning multiple years. Then the total weight for the GPP variable which the GBAF data set carries is
urn:x-wiley:jame:media:jame20779:jame20779-math-0084
We use these weights to blend the overall score (equation 31) from each data set for each variable. In this way ILAMB remains flexible to adding data sets as they are developed, allowing more weight to be given to those that the community believes are more credible and that are more comparable in scale to models.

A third weight reflects how useful the measured variable is in the focus of a model intercomparison project (MIP). Here, as an example, we show weighting for an analysis of model performance in representing the carbon cycle. We use these weights to blend the overall scores from each variable into a complete score across all variables for a given model. This allows ILAMB to include comparisons that are important for a complete understanding of the carbon cycle without necessarily allowing them to heavily influence the overall score. For example, the radiation and energy cycle data sets in Table 4 are all weighted comparatively low because, while they help one understand the carbon cycle, they are not as influential in the overall behavior.

We emphasize that this rubric is particular to our overarching goal of understanding the carbon cycle on global and decadal scales. However, the implementation is flexible and allows for an arbitrary weighting scheme to be developed that suits the needs of the user, community, or MIP that it serves.

The references and weights for each data set that we have selected may be found in Tables 2-5. Each table represents a different aspect of the model: the ecosystem and carbon cycle in Table 2, the hydrological cycle in Table 3, the radiation and energy cycle in Table 4, and the forcings in Table 5. For the majority of these data sets, we make a direct comparison of the observed quantity to model outputs, or algebraic combinations of model outputs using the methodology described in section 2. However, there are a few special cases which require specific handling which we describe in the next section.

3.1 Special Cases

In general, a consistent methodology is applied to compare model output with each data set. This consistency across variables and data sets is a strength of the ILAMB methodology. However, this is not always possible, and here we enumerate a few exceptions and how they are handled.

3.1.1 Evaporative Fraction

To test the partitioning of surface energy, we compare the evaporative fraction derived from the GBAF (Jung et al., 2010) data product to that of the models. The evaporative fraction is an algebraic expression in terms of the latent heat Le(t,x) and the sensible heat Sh(t,x), given as
urn:x-wiley:jame:media:jame20779:jame20779-math-0085(39)

The expression can cause nonsensical results because in winter, the sensible heat flux can be negative, leading to a change of sign in the evaporative fraction. The expression can also lead to large evaporative fraction values since the magnitudes of both the latent and sensible heat can become small. For this reason, we apply a mask to ef,Le, and Sh only considering values for which Sh>0,Le>0, and Sh+Le>ϕ, where ϕ = 20 (W/m2) is a surface energy threshold.

Equation 39 is used to study how models partition the surface energy throughout the relevant season. Thus, we use that expression when computing the RMSE or seasonal cycle. However, when comparing period mean values and the bias, equation 39 leads to a combination of averaging methods. For this reason, when computing the mean evaporative fraction over time and the bias, we use a ratio of means in place of the mean of the ratio,
urn:x-wiley:jame:media:jame20779:jame20779-math-0086(40)

Beyond this change, the evaporative fraction is evaluated using the methodology defined in section 2.

3.1.2 Albedo

We compare the albedo derived from observational data products (Kato et al., 2013; König‐Langlo et al., 2013; Stackhouse et al., 2011) to that of models using the following expression:
urn:x-wiley:jame:media:jame20779:jame20779-math-0087(41)
where Ru and Rd are the upward shortwave radiation and downward shortwave radiation, respectively. As with the evaporative fraction in section 3.1.1, the albedo expression can become numerically unstable when Rd approaches 0. Thus, we again apply a mask, ignoring regions where no significant incoming radiation is observed, Rd<δ. Equation 41 is used when comparing the RMSE and seasonal cycle. When the period mean and bias are computed, we compute the period mean average albedo based on the ratio of averages,
urn:x-wiley:jame:media:jame20779:jame20779-math-0088(42)

3.1.3 Global Net Ecosystem Carbon Balance

The observational data sets for the global net ecosystem carbon balance (Hoffman et al., 2014; Le Quéré et al., 2016) represent global totals, yet models return this value as fluxes defined over space. To create a model quantity commensurate with the observational data, ILAMB must integrate over the globe using equation 4. As the observational data set is a time series, much of our scoring methodology does not apply. For this discussion we will represent the global rate of carbon as nbp (Pg C/year). We compute the accumulation of nbp
urn:x-wiley:jame:media:jame20779:jame20779-math-0089(43)
and score the difference in accumulated total at the end of the time period. The precise method differs slightly in each observational data set.
The Global Carbon Project (GCP) data set is derived by taking the land sink (uncertainty of ±0.8 (Pg C/year)) and subtracting the emissions from land use change (uncertainty of ±0.5 (Pg C/year)). This means that the total uncertainty of the accumulated nbp at the end of 2010 is urn:x-wiley:jame:media:jame20779:jame20779-math-0090 (Pg C). We use this uncertainty to normalize the difference in accumulation at the end of the time period as a measure of relative error,
urn:x-wiley:jame:media:jame20779:jame20779-math-0091(44)
and then again equation 9 to compute a score of the difference
urn:x-wiley:jame:media:jame20779:jame20779-math-0092(45)
where αnbp=0.287 and is chosen such that if a model falls within the certainty bounds of the accumulated amount through 2010, the corresponding score is at minimum 0.75. We see this as an important first step in incorporating uncertainty into the comparison methodology. We use the uncertainty to tune the scoring methodology, giving a good score to models that fall inside this uncertainty bound. We also compare the global rates of carbon across the time period in the form of a Taylor score of the time series, urn:x-wiley:jame:media:jame20779:jame20779-math-0093, equation 30, where the correlation and standard deviation are taken across the temporal dimension. Then the overall score is
urn:x-wiley:jame:media:jame20779:jame20779-math-0094(46)
In the Hoffman et al. (2014) data set, we only score the accumulated amount at the end of the observed period. We omit providing a Taylor scoring of the rates because there appears to be some smoothing of the rate data inherent in the process of producing this data set. However, this data set explicitly provides a lower and upper bound on uncertainty as a function of time throughout the data set. So we determine the integrated uncertainty at the end of 2010 by accumulating the upper (52.4 Pg C) and lower (−32.1 Pg C) limit of uncertainty, computing the difference, and then halving the value resulting in an uncertainty of 42.3 (Pg C). We then use the same approach to score the difference,
urn:x-wiley:jame:media:jame20779:jame20779-math-0095(47)
urn:x-wiley:jame:media:jame20779:jame20779-math-0096(48)

3.1.4 Runoff

We use the Dai and Trenberth (2002) river discharge data set to assess model performance of runoff for the world's 50 largest river basins. First, we compute the mean annual runoff from the model over the time period of the observational data set. Then we take the river discharge data and distribute it over the area of the river basins and compare this to the mean runoff over the same basin. This simple approach was taken to allow us to compare runoff across models even if they do not have a river routing model.

We include plots of the mean runoff of the reference and model over river basins and the bias, represented in Figure 10. We also include regional mean runoff plots for each of the river basins included but only show that of the Amazon river basin in Figure 10d. The model performance is then scored using the bias (section 2.2.1), the interannual variability (section 2.2.4), and the spatial distribution (section 2.2.5) metrics.

jame20779-fig-0010
Comparisons of runoff between the reference (Dai & Trenberth, 2002) and the model (CLM4.5) data sets. (a) Reference mean runoff; (b) model mean runoff; (c) mean runoff bias; (d) annual mean runoff for the Amazon river basin where the reference is shown in gray and the model in maroon.

3.1.5 Terrestrial Water Storage Anomaly

We use the Gravity Recovery and Climate Experiment (Swenson & Wahr, 2006) data set to assess the terrestrial water storage anomaly in models. However, there are a few challenges in producing a fair comparison. The first of those is that models report only the storage and so the anomaly must be computed. The more serious challenge is that the resolution of this data is quite coarse (300–400 km]), and thus, pointwise comparisons are not appropriate (Swenson, 2013). Instead, we compare mean anomaly values over 30 of the world's largest river basins. In this way the comparison is more fair as it is over large areas and automatically omits dry areas, which are not of interest.

We include plots of the magnitude of the mean anomaly of the reference and model over river basins and the RMSE, represented in Figure 11. We also include regional mean anomaly plots for each of the river basins but show only that of the Amazon river basin in Figure 11d. The model performance is then scored using the RMSE (section 2.2.2) and the interannual variability (section 2.2.4) metrics.

jame20779-fig-0011
Comparisons of the terrestrial water storage anomaly between the reference (Gravity Recovery and Climate Experiment) and the model (CLM4.5) data sets. (a) Reference mean anomaly magnitude; (b) model mean anomaly magnitude; (c) mean anomaly RMSE; (d) annual mean anomaly for the Amazon river basin where the reference is shown in gray and the model in maroon.

4 Software

We have implemented the methodology described in sections 2 and 3 into a software package that is freely available to the community. We previously developed a prototype implementation (Mu et al., 2015) based on the National Center for Atmospheric Research Command Language. We then moved the algorithm into an open source, openly developed python package (Collier et al., 2016) in an effort to produce a product to which the community can more easily make contributions. The referenced digital object identifier will lead to the software repository, where the source code and documentation can be found. The documentation includes the public interface as well as tutorials that span topics such as installation, basic usage, adding models or benchmark data sets, and formatting benchmark data sets.

The ILAMB package is designed to ingest data sets that follow the Climate and Forecast convention (Eaton et al., 2017). The Climate and Forecast website explains that the “conventions define metadata that provide a definitive description of what the data in each variable represents, and the spatial and temporal properties of the data. This enables users of data from different sources to decide which quantities are comparable, and facilitates building applications with powerful extraction, regridding, and display capabilities.” We have built the ILAMB package to embody this philosophy, making it directly useful to those who adhere to this standard. While model intercomparison efforts, such as CMIP5, have encouraged the use of these conventions among modelers, the observational community has not yet widely put them into practice. Much of the work in adding data sets to the collection is in encoding them to follow this convention.

For the purpose of communicating how the ILAMB package works, consider the configure file shown in Figure 12, which defines a set of observational data sets that will be used to confront models. The h1 bracket is a heading used to categorize variables, represented by the h2 heading. This comparison involves the surface upward shortwave radiation and the albedo, both of which are variables belonging to the radiation and energy cycle. Inside each h2 heading, we specify the variable name that will be compared (rsus is the netCDF variable name for surface upward shortwave radiation). However, we provide a mechanism for variable synonyms in this case by specifying alternate variable names. If the ILAMB system cannot find the main variable, it will try to find any alternates that the user specifies. This allows the software to encourage the use of standard variable names but accounts for modeling groups wanting to use ILAMB without preprocessing. Also note the derived keyword in the albedo section. While the components of albedo are part of standard model output, the albedo is not. The ILAMB package allows for users to specify algebraic relationships in the configure file process. This makes the process automatic and transparent to those who may read this configure file.

jame20779-fig-0012
Sample International Land Model Benchmarking configure file defining comparisons to the surface upward shortwave radiation and albedo variables from the CERES (Kato et al., 2013) product.

The ILAMB package will ingest this configure file and try to build commensurate quantities from model outputs. While observational data sets come in different forms (globally gridded remote sensing products, tower data collections, etc.), the ILAMB system reads the spatial and temporal information found in the file and uses it to trim, subsample, and/or coarsen the model data as appropriate.

5 Discussion

The ILAMB framework is designed to be both powerful and flexible. While we have made choices in the default configuration, described above, focused on global analysis for decadal to centennial scale ESMs, ILAMB allows the user to customize selection of variables, weighting of data sets, and spatial subsetting that make it useful for assessing results from mesoscale weather forecasting or other models. We envision developing a library of sample configuration files, targeting various well‐known models and model applications.

As much of the usefulness of ILAMB depends on the quality of the underlying observational data, we recommend that data providers include explicit representations of the underlying spatial grids including the areas over which quantities have been averaged. Observational data sets frequently report mean values in a cell taken over an area which may include land but also portions of lakes, rivers, and oceans. This leads to ambiguity with regard to the contribution of land cover types to the measurement itself and subsequently adds to the uncertainty when comparing values to model output.

5.1 Interpreting the Overall Score

The thrust of this paper is to detail a methodology for computing a single overall score that captures a model's skill in reproducing patterns found in the observed record. However, we do not view the absolute value of the score as particularly meaningful beyond the precise definition described in this paper. In general, no model can achieve a perfect score for any given variable for several reasons.

First, there is measurement error and uncertainty in the observational data that makes a perfect score unlikely or undesirable against even a single data set. This is what motivates some in the community (Abramowitz, 2005; Best et al., 2015) to pose that benchmarking requires an expectation of performance, which is admittedly lacking in our approach. Second, despite that every attempt is made to employ multiple independent data sets of high quality for confrontation with models, these data sets are inconsistent with each other, making a perfect score across all data sets impossible. We do this as comparisons with multiple observational and synthesized data sets for a single variable to offer the user more information about the robustness of model predictions within the limits of observational uncertainty at varying spatial and temporal scales. Third, a lower score with respect to a given variable is not necessarily a sign of a poor model. It may rather highlight the need for better measurement campaigns or improved metrics (i.e., sometimes we learn that our measurements are incomplete or do not acknowledge important uncertainties, or our metrics are inappropriate for a given data set).

The overall score is meant to aid the scientist in discovering when meaningful changes have occurred in the model or across models. The holistic nature of the ILAMB suite of data sets and metrics helps provide a synthesis of model performance that directs the attention of the user to relevant aspects. While we present Figure 1 as the main result of the ILAMB methodology, it is intended to merely indicate variables of particular interest for further consideration. ILAMB output is presented as a hierarchy of interactive web pages that employ JavaScript features to present information to users in a logical and intuitive fashion. From the graphical overview, the user can select individual variables and data sets from the Results Table tab to be led to pages that detail the contributing factors to the model's overall score. On this new page, predefined spatial regions can be individually selected, causing the tabular data and diagnostics to be updated automatically to reflect information relevant only to that region. Although all the tabular information, scores, and graphical diagnostics are precomputed and generated when ILAMB is run, the web‐based interface is designed to facilitate discovery and understanding of model results. The overall score does not replace the scientist, it guides her/him to the relevant plots and diagnostics.

5.2 How Is ILAMB Used?

The ILAMB package is particularly useful for verification, that is, during model development to confirm that new model code improves performance in a targeted area without degrading performance in another area, and for validation, that is, when comparing performance of one model or model version to that of other models or model versions.

In developing and applying the ILAMB package, we have incorporated a wide variety of representative observational data sets (see Tables 2-5), and we have favored data that have the most open data policies. In many cases, these data have been averaged or remapped to be more directly comparable with model output. As this collection of data sets grows, maintaining and distributing the latest versions will be challenging and require community collaboration. For tracking the evolving performance of models over the long term, it may be necessary to maintain access to older versions of data as well as the latest version since corrections to observational data sets can significantly impact model performance scores. Various technologies could fill this role, and the Observations for Climate Model Intercomparisons (obs4MIPs; https://www.earthsystemcog.org/projects/obs4mips/) activity shows promise as a potential solution to this challenge. The preferred solution would ideally support versioning and allow for long‐lived versions associated with ILAMB releases. In the interim, we have implemented a simple scheme for sharing summarized and remapped data sets through a web server.

The ILAMB package is currently being used by individual model developers and international modeling centers. ILAMB offers developers a quick and easy method for checking the impacts of new model development before committing code changes. For modeling centers, ILAMB provides a systematic assessment of historical simulation experiments and enables tracking of performance of model revisions. ILAMB will also be useful for MIPs as a starting point for evaluating model variability and uncertainty. As a part of such MIPs, investigators may wish to develop custom metrics or incorporate data sets specific to their purposes. ILAMB could be executed automatically as model results are uploaded to a system like the Earth System Grid Federation (https://esgf.llnl.gov/) to give users a first look at variation in results and to determine if output should be downloaded for a particular study. ILAMB diagnostics can also be useful for parameter sensitivity studies or for optimization experiments in combination with an automated modeling framework like the Predictive Ecosystem Analyzer (http://pecanproject.org/; Dietze et al., 2014; LeBauer et al., 2013). For the assessments community, the results of a multimodel ILAMB evaluation could be useful for understanding which model results would be appropriate for use in studying impacts and which models may poorly capture processes relevant to the impacts under consideration.

5.3 Future Work

Development of the ILAMB package is ongoing, and the terrestrial modeling and observational communities are being engaged to identify in situ and remote sensing data sets, to define additional evaluation metrics, and to use the package for a wide variety of MIPs (Hoffman et al., 2017). While most effort has been invested in global‐ and regional‐scale model evaluation, new work is focused on improved benchmarking for site level time series, spatial transects, and seasonal and diurnal variability. Future development will include incorporation of experiment‐specific model evaluation metrics derived from prior studies, including Free‐Air CO2 Enrichment (Walker et al., 2014, 2015; Zaehle et al., 2014), nutrient addition, rainfall exclusion, and warming experiments (Bouskill et al., 2014; Zhu et al., 2016). Partner activities, like NASA's Permafrost Benchmarking System project and the Arctic‐Boreal Vulnerability Experiment, are integrating additional data sets and building metrics for specific regions, study areas, or processes of interest. We are applying the ILAMB methodology and code base to develop a marine biogeochemical model benchmarking tool, called the International Ocean Model Benchmarking package.

Based on previous prototypes and community discussion, we developed the ILAMB model benchmarking package for evaluating the fidelity of land carbon cycle models. The package generates graphical diagnostics and computes a comprehensive set of statistics through model‐data comparisons and scores model performance for a wide variety of variables for a suite of observational data sets. Rigorously defined model evaluation metrics and strategies for handling multiple resolutions and land masks are documented above. The ILAMB package is open source and is becoming widely adopted by modeling centers and for informing model intercomparison studies. We are actively seeking community involvement in adding more evaluation metrics and new observational data sets.

Acknowledgments

This manuscript has been authored by UT‐Battelle, LLC under contract DE‐AC05‐00OR22725 with the U.S. Department of Energy. The U.S. Government retains and the publisher, by accepting the article for publication, acknowledges that the U.S. Government retains a nonexclusive, paid‐up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for U.S. Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). This research was supported through the Reducing Uncertainties in Biogeochemical Interactions through Synthesis and Computation Scientific Focus Area (RUBISCO SFA), which is sponsored by the Regional and Global Climate Modeling (RGCM) Program in the Climate and Environmental Sciences Division (CESD) of the Office of Biological and Environmental Research (BER) in the U.S. Department of Energy Office of Science. Oak Ridge National Laboratory (ORNL) is managed by UT‐Battelle, LLC for the U.S. Department of Energy under contract DE‐AC05‐00OR22725. The National Center for Atmospheric Research (NCAR) is managed by the University Corporation for Atmospheric Research (UCAR) on behalf of the National Science Foundation (NSF). Lawrence Berkeley National Laboratory (LBNL) is managed and operated by the Regents of the University of California under contract DE‐AC02‐05CH11231.

      Number of times cited according to CrossRef: 39

      • Mathematical Reconstruction of Land Carbon Models From Their Numerical Output: Computing Soil Radiocarbon From C Dynamics, Journal of Advances in Modeling Earth Systems, 10.1029/2019MS001776, 12, 1, (2020).
      • Ensemble projections elucidate effects of uncertainty in terrestrial nitrogen limitation on future carbon uptake, Global Change Biology, 10.1111/gcb.15114, 26, 7, (3978-3996), (2020).
      • Perspectives on the Future of Land Surface Models and the Challenges of Representing Complex Terrestrial Systems, Journal of Advances in Modeling Earth Systems, 10.1029/2018MS001453, 12, 4, (2020).
      • The Community Earth System Model Version 2 (CESM2), Journal of Advances in Modeling Earth Systems, 10.1029/2019MS001916, 12, 2, (2020).
      • Evaluating three evapotranspiration estimates from model of different complexity over China using the ILAMB benchmarking system, Journal of Hydrology, 10.1016/j.jhydrol.2020.125553, (125553), (2020).
      • The DOE E3SM v1.1 Biogeochemistry Configuration: Description and Simulated Ecosystem‐Climate Responses to Historical Changes in Forcing, Journal of Advances in Modeling Earth Systems, 10.1029/2019MS001766, 12, 9, (2020).
      • JULES-GL7: the Global Land configuration of the Joint UK Land Environment Simulator version 7.0 and 7.2, Geoscientific Model Development, 10.5194/gmd-13-483-2020, 13, 2, (483-505), (2020).
      • Evaluation of simulated soil carbon dynamics in Arctic-Boreal ecosystems, Environmental Research Letters, 10.1088/1748-9326/ab6784, 15, 2, (025005), (2020).
      • Benchmarking and parameter sensitivity of physiological and vegetation dynamics using the Functionally Assembled Terrestrial Ecosystem Simulator (FATES) at Barro Colorado Island, Panama, Biogeosciences, 10.5194/bg-17-3017-2020, 17, 11, (3017-3044), (2020).
      • Historical CO<sub>2</sub> emissions from land use and land cover change and their uncertainty, Biogeosciences, 10.5194/bg-17-4075-2020, 17, 15, (4075-4101), (2020).
      • Observations for Model Intercomparison Project (Obs4MIPs): status for CMIP6, Geoscientific Model Development, 10.5194/gmd-13-2945-2020, 13, 7, (2945-2958), (2020).
      • The age distribution of global soil carbon inferred from radiocarbon measurements, Nature Geoscience, 10.1038/s41561-020-0596-z, (2020).
      • CLASSIC v1.0: the open-source community successor to the Canadian Land Surface Scheme (CLASS) and the Canadian Terrestrial Ecosystem Model (CTEM) – Part 1: Model framework and site-level performance, Geoscientific Model Development, 10.5194/gmd-13-2825-2020, 13, 6, (2825-2850), (2020).
      • The PROFOUND Database for evaluating vegetation models and simulating climate impacts on European forests, Earth System Science Data, 10.5194/essd-12-1295-2020, 12, 2, (1295-1320), (2020).
      • Leveraging the signature of heterotrophic respiration on atmospheric CO<sub>2</sub> for model benchmarking, Biogeosciences, 10.5194/bg-17-1293-2020, 17, 5, (1293-1308), (2020).
      • Quantitative assessment of fire and vegetation properties in simulations with fire-enabled vegetation models from the Fire Model Intercomparison Project, Geoscientific Model Development, 10.5194/gmd-13-3299-2020, 13, 7, (3299-3318), (2020).
      • Modeling Functional Organic Chemistry in Arctic Rivers: An Idealized Siberian System, Atmosphere, 10.3390/atmos11101090, 11, 10, (1090), (2020).
      • Soil Organic Matter Temperature Sensitivity Cannot be Directly Inferred From Spatial Gradients, Global Biogeochemical Cycles, 10.1029/2018GB006001, 33, 6, (761-776), (2019).
      • Leveraging Environmental Research and Observation Networks to Advance Soil Carbon Science, Journal of Geophysical Research: Biogeosciences, 10.1029/2018JG004956, 124, 5, (1047-1055), (2019).
      • Improving Representation of Deforestation Effects on Evapotranspiration in the E3SM Land Model, Journal of Advances in Modeling Earth Systems, 10.1029/2018MS001551, 11, 8, (2412-2427), (2019).
      • Representing Nitrogen, Phosphorus, and Carbon Interactions in the E3SM Land Model: Development and Global Benchmarking, Journal of Advances in Modeling Earth Systems, 10.1029/2018MS001571, 11, 7, (2238-2258), (2019).
      • Parametric Sensitivity of Vegetation Dynamics in the TRIFFID Model and the Associated Uncertainty in Projected Climate Change Impacts on Western U.S. Forests, Journal of Advances in Modeling Earth Systems, 10.1029/2018MS001577, 11, 8, (2787-2813), (2019).
      • Parametric Controls on Vegetation Responses to Biogeochemical Forcing in the CLM5, Journal of Advances in Modeling Earth Systems, 10.1029/2019MS001609, 11, 9, (2879-2895), (2019).
      • Future Directions for Whole Atmosphere Modeling: Developments in the Context of Space Weather, Space Weather, 10.1029/2019SW002267, 17, 9, (1342-1350), (2019).
      • Beyond Static Benchmarking: Using Experimental Manipulations to Evaluate Land Model Assumptions, Global Biogeochemical Cycles, 10.1029/2018GB006141, 33, 10, (1289-1309), (2019).
      • Model Structure and Climate Data Uncertainty in Historical Simulations of the Terrestrial Carbon Cycle (1850–2014), Global Biogeochemical Cycles, 10.1029/2019GB006175, 33, 10, (1310-1326), (2019).
      • The potential to reduce uncertainty in regional runoff projections from climate models, Nature Climate Change, 10.1038/s41558-019-0639-x, 9, 12, (926-933), (2019).
      • The Community Land Model Version 5: Description of New Features, Benchmarking, and Impact of Forcing Uncertainty, Journal of Advances in Modeling Earth Systems, 10.1029/2018MS001583, 11, 12, (4245-4287), (2019).
      • Evaluation of the CABLEv2.3.4 Land Surface Model Coupled to NU‐WRFv3.9.1.1 in Simulating Temperature and Precipitation Means and Extremes Over CORDEX AustralAsia Within a WRF Physics Ensemble, Journal of Advances in Modeling Earth Systems, 10.1029/2019MS001845, 11, 12, (4466-4488), (2019).
      • Divergence in land surface modeling: linking spread to structure, Environmental Research Communications, 10.1088/2515-7620/ab4a8a, 1, 11, (111004), (2019).
      • Decadal fates and impacts of nitrogen additions on temperate forest carbon storage: a data–model comparison, Biogeosciences, 10.5194/bg-16-2771-2019, 16, 13, (2771-2793), (2019).
      • Global Carbon Budget 2019, Earth System Science Data, 10.5194/essd-11-1783-2019, 11, 4, (1783-1838), (2019).
      • Competition alters predicted forest carbon cycle responses to nitrogen availability and elevated CO<sub>2</sub>: simulations using an explicitly competitive, game-theoretic vegetation demographic model, Biogeosciences, 10.5194/bg-16-4577-2019, 16, 23, (4577-4599), (2019).
      • The landscape of soil carbon data: emerging questions, synergies and databases, Progress in Physical Geography: Earth and Environment, 10.1177/0309133319873309, (030913331987330), (2019).
      • The Arctic-Boreal vulnerability experiment model benchmarking system, Environmental Research Letters, 10.1088/1748-9326/ab10fa, 14, 5, (055002), (2019).
      • ESD Reviews: Climate feedbacks in the Earth system and prospects for their evaluation, Earth System Dynamics, 10.5194/esd-10-379-2019, 10, 3, (379-452), (2019).
      • Global Carbon Budget 2018, Earth System Science Data, 10.5194/essd-10-2141-2018, 10, 4, (2141-2194), (2018).
      • Weaker land–climate feedbacks from nutrient uptake during photosynthesis-inactive periods, Nature Climate Change, 10.1038/s41558-018-0325-4, (2018).
      • Development and evaluation of a variably saturated flow model in the global E3SM Land Model (ELM) version 1.0, Geoscientific Model Development, 10.5194/gmd-11-4085-2018, 11, 10, (4085-4102), (2018).