Full Implementation of Matrix Approach to Biogeochemistry Module of CLM5

Earth system models (ESMs) have been rapidly developed in recent decades to advance our understanding of climate change‐carbon cycle feedback. However, those models are massive in coding, require expensive computational resources, and have difficulty in diagnosing their performance. It is highly desirable to develop ESMs with modularity and effective diagnostics. Toward these goals, we implemented a matrix approach to the Community Land Model version 5 (CLM5) to represent carbon and nitrogen cycles. Specifically, we reorganized 18 balance equations each for carbon and nitrogen cycles among the 18 vegetation pools in the original CLM5 into two matrix equations. Similarly, 140 balance equations each for carbon and nitrogen cycles among the 140 soil pools were reorganized into two additional matrix equations. The vegetation carbon and nitrogen matrix equations are connected to soil matrix equations via litterfall. The matrix equations fully reproduce simulations of carbon and nitrogen dynamics by the original model. The computational cost for forwarding simulation of the CLM5 matrix model was 26% more expensive than the original model, largely due to calculation of additional diagnostic variables, but the spin‐up computational cost was significantly saved. We showed a case study on modeled soil carbon storage under two forcing data sets to illustrate the diagnostic capability that the matrix approach uniquely offers to understand simulation results of global carbon and nitrogen dynamics. The successful implementation of the matrix approach to CLM5, one of the most complex land models, demonstrates that most, if not all, the biogeochemical models can be reorganized into the matrix form to gain high modularity, effective diagnostics, and accelerated spin‐up.


Introduction
Land biogeochemistry models are an essential component of Earth system models (ESMs) and have been extensively used to study climate change and its impacts on societally relevant commodities such as food, timber, energy, and crops (Bonan & Doney, 2018). Results of biogeochemistry models have been used to guide climate change mitigation (Eyring et al., 2016;Friedlingstein, Andrew, et al., 2014;Jones et al., 2013;Masson-Delmotte et al., 2018;Meinshausen et al., 2009;Stocker et al., 2013), estimate global carbon budget (Friedlingstein et al., 2019;Houghton et al., 2012;Le Quéré et al., 2013;Schaphoff et al., 2013;Song & Woodcock, 2003), and assess climate change-carbon cycle feedback Friedlingstein, 2015;Friedlingstein et al., 2006;Friedlingstein, Meinshausen, et al., 2014;Wenzel et al., 2014). On one hand, the computational requirement is much higher than before as land models have become more complex over time. On the other hand, as land models have become popular worldwide and versatile for exploring a variety of issues, their users are more diverse with different backgrounds, skill sets, and abilities. It is highly desirable to develop land models with simplicity in coding, modularization, effective diagnostics, and high computational efficiency so that scientists can use and further develop the models with ease in understanding, evaluation, and improvement.
An epitome of the land model development is the Community Land Model (CLM). The initial version of CLM was established by merging National Center for Atmosphere Research (NCAR) Land Surface Model (LSM) (Bonan, 1996), Institute of Atmosphere Physics (IAP), Chinese Academy of Sciences land model (IAP94) (Dai & Zeng, 1997), and the Biosphere-Atmosphere Transfer Scheme (BATS) (Dickinson et al., 1993). The early versions mainly focused on biophysical processes of the land surface. CLM has been continually developed from version 2 (CLM2) in 2002 to version 5 (CLM5) in 2018, including changes to model structure as well as the addition of process representation. In particular, CLM3.5 for the first time included the carbon cycle and CLM4 represented the coupled carbon-nitrogen cycle based on the model Biome-BGC (BioGeochemical Cycles) (Running & Hunt, 1993;Thornton et al., 2002) and the land use change. CLM4.5 incorporated a vertically resolved soil biogeochemistry scheme (Koven et al., 2013), fire model (Li et al., 2012(Li et al., , 2013, methane model (Riley et al., 2011), crop model (Drewniak et al., 2013;Sacks et al., 2009), and carbon isotope model (Koven et al., 2013). CLM5 further updated the biogeochemical modules, such as plant nutrient dynamics using the Fixation and Update of Nitrogen (FUN) model (Fisher et al., 2010), and photosynthetic capacity using Leaf Use of Nitrogen for Assimilation (LUNA) model (Xu et al., 2012). The development of land surface models has met the increasing need from the users and, meanwhile, makes models very complex. The lines of the source code were 32,651 in CLM2 and increased to over 150,000 in CLM5, of which nearly 50,000 lines are used for the biogeochemical module. The increasing complexity of CLM means that the development enterprise would benefit from greater modularization and methods that help diagnose model results. In addition, the long spin-up for soil biogeochemistry hampers uses of the model by a wider range of researchers. Luo et al. (2017) proposed a matrix approach to land biogeochemistry modeling. The approach reorganizes carbon balance equations in the original models into one matrix equation without changing any modeled carbon cycle processes and mechanisms. The matrix approach was initially implemented in the Terrestrial ECOsystem model (TECO) for data assimilation studies (Luo et al., 2003;Xu et al., 2006). The matrix approach has recently been applied to global land models, such as the Community Atmosphere-Biosphere-Land Exchange model (CABLE) (Xia et al., 2012(Xia et al., , 2013, LPJ-GUESS (Ahlstrom et al., 2015), ORCHIDEE (Huang, Zhu, et al., 2018), CLM3.5 (Hararuk et al., 2015;Hararuk & Luo, 2014;Rafique et al., 2016), CLM4 (Rafique et al., 2017), CLM4.5 (Huang, Lu, et al., 2018), and CLM5 in this study. The CABLE matrix model was used to accelerate its spin-up (Xia et al., 2012) and for traceability analysis to trace sources of uncertainty in model simulations of terrestrial carbon storage (Xia et al., 2013). The LPJ-GUESS matrix model was used to analyze the uncertainty of future climate-induced C uptake (Ahlstrom et al., 2015). Matrix versions of CLM4, CASA, and CABLE were compared to assess carbon dynamics using the traceability framework (Rafique et al., 2017). Overall, the matrix approach offers simplicity in coding, modularization, effective diagnostics, and high computational efficiency for spin-up. The latter makes it possible to conduct complex parameter sensitivity analysis (Huang, Lu, et al., 2018) and data assimilation (Hararuk & Luo, 2014;Shi et al., 2018;Tao et al., 2020). This paper describes the full implementation of the matrix approach to CLM5. It is built upon the study by Huang, Lu, et al. (2018) who reorganized soil carbon balance equations in CLM4.5 into a stand-alone matrix model. This study reorganizes mass balance equations of carbon and nitrogen cycles in both vegetation and soil into four matrix equations and, more importantly, makes the matrix model executable with CLM5 and Community Earth System Model version 2 (CESM2). We first describe the mathematic details of the matrix model and numerical calculation using a sparse matrix technique for forward modeling. Then, we verify the CLM5 matrix model to exactly reproduce modeled global carbon and nitrogen dynamics in the original model. We use a technical test package developed by National Center for Atmospheric Research (NCAR) to ensure that the matrix carbon and nitrogen modules 100% meet the technical standards set by and can be compiled and executed with, CLM5 and the Community Earth System Model version 2 (CESM2). We also evaluate and compare the computational cost of CLM5 with and without the matrix model. In addition, we show the diagnostic capability that the matrix model uniquely offers to help understand and interpret simulation results of global carbon dynamics.

Biogeochemical Modules of the Original CLM5
The CLM5 biogeochemistry module includes carbon cycle and nitrogen cycle for aboveground and belowground processes . The vegetation biogeochemical cycle in CLM5 contains 18 carbon pools, including six tissue pools: leaf, fine root, live stem, dead stem, live coarse root, and dead coarse root ( Figure 1 and Table 1). Each tissue pool is accompanied by a storage pool and a transfer pool. The vegetation nitrogen cycle includes one additional pool, a retranslocation pool to store reusable

10.1029/2020MS002105
Journal of Advances in Modeling Earth Systems nitrogen from litter fall. A crop grain tissue pool, accompanied by a grain storage pool and a grain transfer pool, is added when the crop model is used. The belowground module of CLM5 has 20 soil layers by a default setting. Each layer contains seven pools for organic carbon and organic nitrogen, respectively, in metabolic litter, cellulose litter, lignin litter, coarse woody debris, fast soil organic matter, slow soil organic matter, and passive soil organic matter (Table 2). Thus, there are a total of 140 soil organic carbon pools (7 pools × 20 layers) and 140 soil organic nitrogen pools. Inorganic nitrogen pools, such as ammoniacal nitrogen and nitrate nitrogen pools, and related processes such as leaching, nitrification, denitrification, atmospheric nitrogen decomposition, and biological nitrogen fixation have all been preserved in original CLM5 biogeochemistry cycle modules but not reorganized into the matrix equation for soil nitrogen cycle.
Changes in vegetation carbon and nitrogen pool sizes are controlled by many processes. Photosynthesis, plant respiration, and plant nitrogen uptake, for example, control the net carbon and nitrogen input into vegetation carbon and nitrogen pools. Net carbon and nitrogen input are allocated to different storage pools and tissue pools of vegetation. Reusable nitrogen before litter falling is first stored in retranslocation pool, and then transferred to storage pools or display tissue pools for plant growth.
Vegetation carbon and nitrogen dynamics are also controlled by phenology for deciduous plant functional types (PFTs) while evergreen PFTs have constant leaf turnover rates. Plant phenology determines the leaf onset and offset dates according to air temperature and soil water conditions. During the onset period, plant carbon is transferred from storage pools through transfer pools to tissue pools within a half month, so that leaf area index grows rapidly. In the offset period, leaf and fine root carbon is lost to litterfall and leaf area index drops quickly. In addition, harvest, natural death, and fire alter the vegetation biogeochemical cycles. Harvest is included when land use change occurs according to a scheme of Land Use Harmonized version 2 (LUH2). LUH2 was used in Land Use Model Intercomparison Project (LUMIP; http://cmip.ucar.edu/lumip) as part of Coupled Model Intercomparison Project 6 (CMIP6). Part of plant carbon is removed from ecosystems and part goes to litter pools according to the harvest rate as a function of the PFT transition area. Conservation in mass, energy, carbon, and nitrogen has been considered to reconcile the change in area. Natural death occurs as a result of aging and is applicable to all plant functional types. Carbon moves from plant pools litter pools at defined mortality rates. The fire module triggers the occurrence of occasional fire events based on the amount of the fuel (i.e., litter) and the soil moisture. When fire occurs, carbon and  Passive soil organic matter nitrogen in plant pools are partly released to the atmosphere and partly transferred to the litter pools based on their tissue quality and burned area.
Following the vegetation biogeochemical cycles, soil carbon, and nitrogen cycles are controlled by various soil processes in CLM5. Carbon and nitrogen transfers in soil biogeochemical cycles are related to soil and litter decomposition, vertical mixing, mineralization, immobilization, atmospheric nitrogen deposition, biological nitrogen fixation, nitrification, denitrification, leaching, and fire.
In this study, we focus on reorganization of the organic carbon and nitrogen balance equations into a matrix form. As a consequence, code structure and the update of state variables within each time step are changed. All the other processes, such as decomposition, mineralization, immobilization, fire, nitrogen deposition, biological fixation, nitrification, and denitrification, remain intact as in the original code without any changes in this study.

Matrix Representation of the Vegetation Carbon and Nitrogen Cycles
In the new matrix representation, we reorganize CLM5 vegetation carbon and nitrogen balance equations into two matrix equations: C veg and N veg are two time-dependent state variables, which are n entry vectors, each representing its respective vegetation pool size (g C m −2 and g N m −2 ). I Cin and I Nin are scalars for carbon and nitrogen input, respectively. Carbon input is the net primary productivity, which is the difference between gross primary productivity and autotrophic respiration. Nitrogen input for vegetation nitrogen cycle includes both nitrogen fixation and uptake. B is also a n entry vector, representing allocation fraction of plant carbon or nitrogen input to individual pools. K is an n × n diagonal matrix. Its subscripts ph, gm, and fi indicate phenology, gap mortality (i.e., harvest from land use and natural mortality), and fire processes, respectively. The diagonal entries are the exit rates of all vegetation pools due to phenology (K ph ), gap mortality (K gm ), and fire (K fi ), respectively, as described by The exit rates in plant phenology matrix K indicate the leaf, root, live stem, and dead stem turnover due to phenology processes. The exit rates in gap mortality matrix K include harvest rates from land use plus natural mortality. The exit rates in fire matrix K represent the plant carbon loss rate due to the fire.
A is a transfer coefficient matrix, representing carbon and nitrogen transfer among pools as specified in (Equations 6-11) below for CLM5. Subscripts c and n are for carbon and nitrogen. 10.1029/2020MS002105

Journal of Advances in Modeling Earth Systems
10.1029/2020MS002105

Journal of Advances in Modeling Earth Systems
The off-diagonal entry, a i, j , for matrix A represents a fraction of carbon or nitrogen leaving pool j that goes to pool i. The diagonal entry is set to −1 to represent that all the exiting carbon leaves pool j. The pool names referred by the subscript i or j can be found in Table 1.
Interactions between vegetation carbon and nitrogen cycles in the original CLM5 are fully preserved in CLM5 matrix model. The original CLM5 has two modules to regulate carbon and nitrogen interactions for vegetation. First, the photosynthetic capacity, an important variable driving the carbon cycle, interacts with nitrogen cycle in the LUNA module. LUNA optimizes nitrogen allocation to maximize the daily net photosynthetic carbon gain. Second, plant nitrogen uptake interacts with the carbon assimilation in the FUN module. FUN is based on the concept that nitrogen uptake requires the expenditure of energy in the form of exchange for carbon. Both the modules are fully preserved to drive the carbon and nitrogen interactions in the CLM5 vegetation matrix model.

Matrix Representation of CLM5 Soil Carbon and Nitrogen Cycles
The matrix representation of CLM5 soil biogeochemical cycle follows the previous work in CLM4.5 (Huang, Lu, et al., 2018). The new implementation in this study extends the maximum soil layers from a fixed value of 10 in CLM4.5 to a default value of 20 with flexibility to change in CLM5. In addition, we convert nitrogen balance equations in the original CLM5 to a matrix representation. The soil organic carbon and nitrogen transfer among soil pools is formulated by following matrix equations: As with vegetation, C soil and N soil are two vectors of state variables representing soil organic carbon and nitrogen pool sizes in g C m −3 and g N m −3 , respectively. I Csoil and I Nsoil are vectors representing plant litterfall into different litter carbon and nitrogen pools, respectively. A hc and A hn represent the horizontal transfers of carbon and nitrogen, respectively, which means transfers among pools within one soil layer. V stands for the rate of vertical mixing between the same types of pools across soil layers, which is the same for carbon and nitrogen. K f is the rate of fire-induced litter loss, which is the same for carbon and nitrogen.
Matrix A h , including both A hc and A hn , is a block matrix constituted by several matrices as Each of the matrices A ij within the block matrix A hc or A hn is 20 by 20, corresponding to 20 soil layers. These nonzero, off-diagonal matrices A ij , i ≠ j, indicate carbon and nitrogen transfer among pools within one soil layer as 10.1029/2020MS002105

Journal of Advances in Modeling Earth Systems
Each of the diagonal matrices A ii is a negative identical matrix. The off-diagonal matrices A ij have non- The parameter r ij is the respired fraction of carbon along the transfer pathway from pool j to i. T ij represents a pathway fraction of carbon going to pool i from that leaving the jth pool due to decomposition. CN j, k represents carbon and nitrogen ratio of pool j in layer k.
The diagonal matrices K h and K f indicate the turnover rates, respectively, due to decomposition (horizontal transfer) and fire, at different layers: Environmental scalars ξ are time-dependent variables and are the product of temperature scalar ξ T , water scalar ξ W , oxygen scalar ξ O , depth scalar ξ D , and nitrogen scalar ξ N : The vertical mixing coefficient matrix V is made up of six identical matrices v: Note that vertical mixing of coarse woody debris (CWD) is not allowed in CLM5; therefore, the corresponding vertical mixing matrix is 0 for CWD. The matrix v is a tridiagonal matrix, indicating the vertical mixing only transfers between adjacent layers: The subscripts represent the soil layer. g and h are vertical mixing rates related to upward and downward transfers.
As the same for the vegetation part, interactions between soil carbon and nitrogen cycles in the original CLM5 are fully preserved in the CLM5 soil matrix model. In the original CLM5, nitrogen limits soil organic carbon decomposition. The nitrogen limitation is represented by the ratio between available mineral nitrogen and the total soil nitrogen demand. The soil nitrogen demand includes soil immobilization during soil decomposition and plant uptake demand. The dynamics of mineral nitrogen processes that are involved in carbon and nitrogen interactions can be represented by one equation as done by Shi et al. (2016). The equation on mineral nitrogen dynamics can be coupled with the matrix equations on organic carbon and nitrogen processes to analytically or semianalytically explore their interactions.

Calculation of Storage Capacity and Storage Potential
One benefit of the matrix approach is expanded diagnostic capability. The transient dynamics of ecosystem carbon and nitrogen are interpretable from storage capacity and storage potential (Huang, Lu, et al., 2018;Jiang et al., 2017;Luo et al., 2017;Zhou et al., 2018). Theoretically, the storage capacity and potential can be calculated at any time point . For the purpose of diagnostics, we only calculate the annual average of vegetation storage capacity for both carbon and nitrogen from annual cumulative carbon and nitrogen inputs (I Cveg,ann and I Nveg,ann ) and yearly transfer coefficients (Π ann, veg, C and Π ann, veg I Cveg, ann and I Nveg, ann are vectors to allocate carbon and nitrogen uptake to different vegetation pools. The units are g C m −2 year −1 and g N m −2 year −1 , respectively. Π ann, veg, C for vegetation carbon and Π ann, veg, N for vegetation nitrogen are matrices of annual transfer coefficients representing annual transfer rate (in unit of year −1 ) from one pool to another (nondiagonal entries) or exit rate leaving one pool (diagonal entries) as where C soil, d (t) and N soil, d (t) are diagonal matrices of soil carbon and soil nitrogen.
are inverse diagonal matrices of soil carbon and soil nitrogen.
Based on Luo et al. (2017), the storage potential is the storage capacity minus the storage, representing the amount of additional C that an ecosystem can store to reach the storage capacity

Code Structure and Update of State Variables
Implementation of the matrix model helps modularize the code to more clearly describe the complex carbon and nitrogen transfer network in CLM5. The original code calculates the dynamics of carbon and nitrogen individually from different processes, such as phenology, gap mortality (i.e., plant death), harvest, fire, soil decomposition, and vertical mix in six subroutines. Updates of state variables are carried out stepwise in eight subroutines (Figure 2). With the matrix formulation, entries in matrix Equations 1, 2, 12, and 13 are taken from subroutines of the original CLM5. For example, all the entries for term A phc (t) of Equation 1 (as expressed in Equation 6) are calculated in the phenology subroutine and passed to the vegetation matrix module. Once all the entries from the different subroutines are calculated and passed to the vegetation and soil matrix modules, the matrix model updates the carbon and nitrogen state variables once per time step (Figure 2). The matrix model then calculates new diagnostic variables such as those described in section 2.4.
A namelist switch is implemented, which allows users to use either the matrix code or the original code. When the matrix code is on, state variables update once per time step in the matrix modules. When the matrix code is off, state variables update three times, respectively, after phenology and soil decomposition, gap mortality and harvest, and fire, stepwise per time step in the original CLM5 modules.

Sparse Matrix
Since the matrix equations include all the carbon and nitrogen pools of plants and soil layers, the plant transfer matrix is at least 18 by 18 and the soil transfer matrix is 140 by 140 for carbon, with similarly sized matrices for nitrogen. The solution of these large matrix equations is computationally expensive. However, the number of nonzero entries in A hc /A hn , K h , V, and K f of the soil matrix equations account for only 3.0%, 0.2%, 3.2%, and 0.2% of all the entries, and therefore these matrices can be considered sparse matrices. Since the time spent to access a matrix variable increases exponentially with the matrix size, the computational cost to operate the soil matrix in 140 by 140 is 300% more than that for the original CLM5. This is unacceptable for the code structure development. To improve the computation efficiency for CLM5 matrix modules, we use a sparse matrix solution method.

Sparse Matrix Format
We use a coordinate compact format to store the sparse matrix. The format records the rows, columns, and values of nonzero entries in three vectors with the same order. The sparse matrix formats use smaller memory for soil matrix equations (A hc /A hn , K h , V, and K f memory reduced by 91%, 99.4%, 90.4%, and 99.4%, respectively). The nonzero entries are recorded in a matrix from top to bottom and then from left to right 10.1029/2020MS002105

Journal of Advances in Modeling Earth Systems
columns. These coordinates of all the sparse matrices in CLM5 are shown in Tables S1-S4 in the supporting information.
In Equations 1, 2, 12, and 13, we save A phc , A phn , A gmc , A gmn , A fic , A fin , A hc , A hn , V, and K f in the coordinate compact format. According to the format, the value of the ith nonzero entry in a sparse matrix A can be uniquely identified. Therefore, we notate them as A(i). The row and column number of the ith nonzero entry can be also notated as Row A (i) and Col A (i). A(i), Row A (i), and Col A (i) together uniquely define a sparse matrix. In those four equations, there are also seven diagonal matrices, K phc , K phn , K gmc , K gmn , K fic , K fin , and K h . Those diagonal matrices K are converted to vectors using the diagonal vector format. The diagonal entry in row i can be notated as K(i).

Matrix Operation in the Sparse Matrix Format
Generally, operation of matrix Equations 1, 2, 12, and 13 includes multiplication and addition. The matrix multiplication occurs between (1) coordinate compact format and diagonal vector format and (2) coordinate compact format and vector format.
Sparse matrix in coordinate compact format is multiplied by diagonal matrix in diagonal vector format, such as matrix A with matrix K in Equations 1, 2, 12, and 13. Since K is a diagonal matrix, the locations of nonzero entries in the production AK matrix should be the same with A. Thus, we have where AK(i) represents the ith nonzero entry in the production AK, Col AK (i) is the column coordinate of the ith nonzero entry in the sparse matrix AK, and Row AK (i) is the row coordinate of the ith nonzero entry in the sparse matrix AK.
When matrix A multiply vector C in Equations 1, 2, 12, and 13, the production AC is also a vector. Thus, we have where AC(i) and C(i) represent the ith entry in the vector AC and C. n is the size of the matrix A, which also equals the number of pools, for example, n = 140 for soil carbon cycle module in CLM5.
The matrix addition is carried out for (1) two coordinate compact formats when there is no fire and (2) three coordinate compact formats when fire is on. Sparse matrix addition operation is more complicated than sparse matrix multiplication, since the location of the nonzero entries in addition is not easily predicted. If the locations of nonzero entries in to-be-added matrices (e.g., A phc K phc , and A gmc K gmc ) does not change for any specific addition (e.g., A phc K phc + A gmc K gmc ), the addition should follow a constant mapping function. This is because the coordinates order in addition matrix should be exclusive in the coordinate matrix format. Therefore, we save the constant mapping function to avoid repeated calculation.
Two sparse matrices addition are used in coordinate compact format for the second term in Equations 1, 2, 12, and 13 when there is no fire. Both the to-be-added matrices (A and B) and addition matrix (A&B) are sparse matrices in coordinate compact format. The nonzero entries in A&B should be no less than matrix A and B. The ith nonzero entry in A&B can be contributed from A, B or both A and B. The mapping function j = I A → AB (i) records the mapping relation between jth nonzero entry in matrix A and the ith nonzero entry in matrix A&B. Similarly, k = I B → AB (i) records the mapping relation between kth nonzero entry in matrix B and the ith nonzero entry in matrix A&B: Three sparse matrices addition is done for the second term in Equations 1, 2, 12, and 13 when fire is on. All the denotations are similar with Equation 37. The addition is also based on the mapping function I A → ABC , I B → ABC , and I C → ABC from added matrices (A, B, and C) to addition matrix (A&B&C):

Validation and Efficiency Test of the CLM5 Matrix Model
To validate the matrix model, we run global historical (1850-2014) simulations with rising atmospheric CO 2 and warming climate by two versions of CLM5 with the matrix or the original modules at a 4°× 5°resolution to compare their results. The CLM5 model simulations are conducted on Cheyenne system (2019). Simulations are driven by reanalysis meteorological forcing, the Global Soil Wetness Project Phase 3 data set (GSWP3) (Dirmeyer et al., 2006) ranged from 1901 to 2014. Meteorological forcing from 1901 to 1920 are recursively used in simulation period from 1850 to 1900. The atmospheric CO 2 concentration forcing uses the prescribed data from Climate Model Intercomparison Project Phase 6 (CMIP6) historical scenario (Eyring et al., 2016;Meinshausen et al., 2017 To test the efficiency of the matrix model, we use the same spatial resolution and meteorological forcing as for the validation experiment. But we only run the model for one month for the efficiency test. Comparison of the efficiency are made by tracking the elapsed time in each subroutine.
The traceability analysis Xia et al., 2013) assumes that the soil carbon storage equals to soil carbon storage capacity at steady state. The soil carbon storage capacity is expressed by Equation 39 as More complex than the traceability analysis for CABLE by Xia et al. (2013) is that CLM5 introduced the vertical transfer matrix V(t) and fire matrix K f (t). The residence time as expressed in Equation 40 can be further decomposed into Following the terminology by Xia et al. (2013), is the environmental scalar. I is an identity matrix. However, I + (A hc ξ(t)K h ) −1 (V(t) + K f (t)) is hard to be further decomposed and named as a process-related scalar, ξ −1 other t ð Þ. The closer ξ −1 other t ð Þ is to 1, the more dominant the soil decomposition process is. Then, the matrix Equation 41 can be simplified to one dimensional equation: |τ|is the total soil carbon residence time, which is the products of three terms: averaged inverse of processrelated scalars ξ −1 other t ð Þ, averaged inverse of environmental scalars ξ −1 t ð Þ, and baseline residence time |τ base |. represents the ratio of the traceable components between two forcing sets. X i,GSWP and X i,CRU are traceable components from steady state simulations with GSWP and CRUNCEP-v7, respectively. In the global or regional analysis, each traceable component is represented by the spatial average X. Details in the calculation can be found in Text S1.

Global Validation of the Matrix Model to Carbon and Nitrogen Simulations
The temporal dynamics of carbon and nitrogen storage simulated by the matrix modules was compared with the dynamics by the original modules. The CLM5 matrix model and the original CLM5 use the same default initial value without spin-up for the transition simulation. Modeled carbon and nitrogen storage in total ecosystem, vegetation, and soil organic matter from the matrix model matched with those from the original CLM5 (Figure 3). The relative differences of soil carbon and nitrogen are less than 1% for most grid cells. Only 0.4% and 0.5% of the grid cells in vegetation carbon and nitrogen storage, respectively, diverged by more than 1% relative difference (Figures 3g and 3h). Only 2 of 1,466 global grid cells in ecosystem carbon and nitrogen storage diverged by more than 1%. Global annual means in all six carbon and nitrogen storages over 115 years perfectly lined on the 1:1 line (Figures 3a-3f ). The good agreements demonstrate that the matrix model can reproduce both temporal dynamics and spatial patterns of carbon and nitrogen states from the original model.
It was worth repeating that the matrix modules update carbon and nitrogen state variables only once within each time step, whereas the original modules updated these state variables stepwise within each time step. As a consequence, simulated state variables may slightly differ between the two models. Nevertheless, the differences in modeled state variable due to different update methods of the two models were small enough not to generate notable differences as shown in global simulation of the carbon and nitrogen storage.

System Tests
To ensure no technique errors due to introduction of the matrix modules, we conducted system tests to verify four requirements: (1) the model should be able to run to completion, (2) results from direct run and restart run should be completely identical, (3) results should be independent on number of processors and threading, and (4)  The policy of the CLM code development required any new modules to pass a suite of predefined tests before they can be merged into the trunk version of CLM. There were 148 tests in the test suites. For example, the test suites included basic smoke test (SMS) (i.e., a single run of the model) and exact restart test (ERS) to ensure that restart runs give bit-for-bit identical results to a continuous run. All 148 tests in the test suites passed with the new matrix modules. Details of the CLM system tests can be found on this site (https://github.com/ESCOMP/CTSM/wiki/System-Testing-Guide).

Computational Efficiency
The matrix version of CLM5 was slightly more computationally expensive due to the matrix operation and the calculation of additional diagnostic variables. We conducted two 1-month global 1.875°× 2.25°resolution simulations, with matrix on and off. Our test showed that the matrix version of CLM generates 26.1% additional computational cost using the sparse matrix algorithm. Without the sparse matrix algorithm, the computation cost of using the matrix modules was~300% higher than that with the original modules. However, the matrix method has the advantage that it can accelerate spin-up by~1 order of magnitude in comparison with the existing accelerated decomposition (AD) method. This spin-up approach has been successfully implemented into CABLE (Xia et al., 2012). The implementation into CLM5 will be described in another paper.
Approximately four-fifths of the additional computational cost incurred by the matrix modules was from matrix operation (i.e., 20.0% out of the 26.1% additional cost) (Figure 4). The matrix operation for the soil matrix module was more than 3 times that for the vegetation matrix module, due to the size of the soil matrix (140 × 140). The other 6.1% cost was from the matrix recording (i.e., 4.3%), and writing additional matrix diagnostic variables to the output files in the output module (1.2%). Since the test ran CLM5 for only one month, most of the writing cost was from writing the restart file. The relative contribution of rewriting cost may decrease as the model runs longer.

Diagnostics of Global Biogeochemical Dynamics With Traceability Analysis
One major advantage of the matrix approach is to offer a more effective diagnostic capacity than the original biogeochemical models. To demonstrate the diagnostic capability of the matrix model, we compared soil carbon storage from two steady state simulations driven by GSWP3 and CRUNCEPv7 forcing data sets. Our results found a significantly higher soil carbon storage when the model is forced with CRUNCEPv7 data, especially in permafrost region at high latitudes ( Figure 5). The high carbon storage in such regions can be over 250,000 g C m −2 . The rich carbon storage area simulated under CRUNCEPv7 is substantially larger than that under GSWP3. The differences in soil carbon storage can be over 50,000 g C m −2 .
To understand the causes of the large differences in soil carbon storage under the two sets of forcings, we conducted a traceability analysis. Our results showed the global soil carbon storage under CRUNCEPv7 (5,366 Gt C) more than doubled that under GSWP3 (2,462 Gt C). This difference in soil carbon storage was primarily attributed to a much longer soil carbon residence time with CRUNCEPv7 (74.4 years) than GSWP3 (38.0 years). The global soil carbon inputs played a minor role in different soil carbon storages between the two forcing data sets. The carbon input was 72.0 Gt C year −1 under CRUNCEPv7, which was just 11% higher than the 64.6 Gt C year −1 with GSWP3. The carbon residence time can be further decomposed into environmental scalars and baseline residence time. Overall, differences in carbon input, baseline

10.1029/2020MS002105
Journal of Advances in Modeling Earth Systems residence time, temperature scalar, water scalar, oxygen scalar, nitrogen scalar, and other scalars contributed 10.0%, 0.3%, 23.4%, 57.2%, 0.6%, 0.2%, and 8.3%, respectively, to the differences in soil carbon storage between the two forcing sets (Table 3). Among all the traceable components, the water scalar consistently made the most significant contribution to the different soil carbon storage under the two sets of forcings for most of the plant functional types.

Modularity in Coding
Without changing modeled processes, the matrix model reorganizes CLM5 biogeochemical modules according to elements in Equations 1 and 2 for vegetation carbon and nitrogen dynamics and Equations 12 and 13 for soil carbon and nitrogen dynamics. The original modules of CLM5 represent carbon and nitrogen cycles in six subroutines for updating state variables related to phenology, mortality, fire, and harvest, respectively. The reorganized CLM5 matrix model represent carbon and nitrogen cycles in two subroutines for vegetation and soil. Influences of phenology, mortality, and fire on carbon and nitrogen dynamics are expressed by the process rates and transfer matrices (i.e., phenology (K ph , A ph ), gap mortality (K gm , A gm ), fire (K fi , A fi ), soil CN horizontal transfer (K h , A h ), and vertical mixing (V ). The gap mortality matrices (K gm , A gm ) include influences of harvest. The organization of all the carbon and nitrogen processes in the matrix form makes multiple processes more traceable and enables better diagnostics analysis as described in section 4.2. Moreover, the modularity offered by the matrix model benefits teaching and learning about the model processes. One difficulty in learning carbon and nitrogen cycle models by students or postdoctoral scientists is the complicated carbon and nitrogen cycle networks. CLM5 uses hundreds of carbon and nitrogen balance equations to describe carbon and nitrogen cycles. Neither processes nor parameters are easily discerned from mass balance equations. A modularized representation in the matrix model summarizes all the mass balance equations into one matrix equation for either carbon or nitrogen cycle. In particular, the nonzero entries in those transfer coefficient matrices succinctly indicates connections within the networked pools. Note. The number in parentheses is the ratio of one traceable component between the two forcing sets (CRUNCEP-v7 over GSWP). NET = needleleaf evergreen tree, NDT = needleleaf deciduous tree, BET = broadleaf evergreen tree, BDT = broadleaf deciduous tree, BES = broadleaf evergreen shrub, BDS = broadleaf deciduous shrub.

Diagnostic Capability
One of the most useful advantages that the matrix approach offers is the diagnostic capability, such as traceability analysis and semianalytic attribution analysis. Xia et al. (2013) introduced the traceability framework and applied it to analyze sources of differences in modeled carbon cycle at steady state by CABLE. Modeled carbon storage in CABLE, CLM-CASA, and CLM4 were decomposed into the traceable components, net primary productivity (NPP) and residence time (Rafique et al., 2017). The comparison among the traceable components and corresponding observations provided fundamental explanations for the modeled carbon storage deviations at steady state. The traceability analysis was extended to understand transient dynamics of land carbon cycle in response to climate change according to the mathematic properties of the matrix equation . The transient traceability framework was applied to analyze differences in carbon dynamics between Harvard and Duke Forests . The study found that the divergences in allocation coefficients and environmental scalars drove the residence time toward two opposite directions, although modeled carbon storage was only slightly different between Harvard and Duke Forests. At the global scale, uncertainties in modeled historical carbon storage among models in three Model Intercomparison Project (MIPs), which were CMIP5, MsTMIP, and TRENDY, were traced to NPP, residence time, and carbon storage potential .
We implemented the traceability analysis in the CLM5 matrix model and illustrated its utility to understand sources of differences in model simulations with respect to large differences in soil carbon storage in CLM5 when driven by CRUNCEPv7 versus GSWP3, as reported in . On the face of it, the large difference in soil carbon storage arising from two plausible historical climate reconstructions was puzzling. The global mean annual temperature and precipitation only differed by 0.4°C and 20 mm, respectively, between these data sets. Our traceability analysis enabled decomposition of the controls on soil carbon storage into seven components, including residence time and carbon inputs, with the residence time impact further decomposed into environmental scalars and baseline residence time.
Our traceability analysis indicates that the differences in modeled soil carbon storage between CRUNCEPv7 and GSWP3 were primarily due to the water scalars either based on global average or PFT average. Especially, the water scalar contributions from PFTs in permafrost regions are significantly higher (Table 3). The importance of water scalar in permafrost regions is consistent with experimental results that permafrost soil carbon dynamics were very sensitive to changes in soil thermal and hydrological conditions (Mauritz et al., 2017). CLM5 matrix model can further explain the high sensitivity mathematically. In matrix Equations 25 and 27, the water scalar in the carbon storage capacity is calculated as an inverse of the water scalar (1/ξ W ). Phase changes in soil water from frozen to active soil or the reverse can cause dramatic changes in soil water scalar. For example, water scalar is usually nearly 0 when soil water is at phase change.
The inverse of a nearly 0 number can be extremely large and cause huge changes in simulated soil carbon even with a small change in precipitation or air temperature. Our results not only confirmed the conclusion on the large differences in permafrost regions by  but also accurately identified the causes of the differences.
The diagnostic capability that the matrix model offers not only help identify sources in modeled carbon cycle dynamics caused by different sets of meteorological forcing but also help understand model structures. For example, Du et al. (2018) applied the traceability analysis to evaluate the large differences in carbon storage capacity among three existing schemes in carbon-nitrogen coupling. With assistance of the traceability analysis, the differences were mainly attributed to downregulation of photosynthesis, plant tissue C:N ratio, and plant nitrogen uptake.
Besides the traceability analysis, the matrix approach enables the semianalytic attribution to evaluate relative importance of different processes in modeled biogeochemical dynamics. For example, Huang, Lu, et al. (2018) used semianalytic attribution to evaluate CO 2 fertilization effects from different processes.
The study evaluated relative importance of several processes, such as changes in carbon input, allocation, CO 2 -induced nitrogen limitation, environmental scalars, and vertical mixing processes, for CO 2 effects on soil carbon storage. The semianalytic attribution identified carbon input as the most important contributor.

Matrix Representation of Terrestrial Carbon Cycle Models
the commonly used models, such as CLM5, and nonlinear microbial models (Sierra & Muller, 2015). The matrix equation unifies carbon cycle models by varying its dimensions and details of expression of its terms related to carbon input, plant allocation, process rates, carbon transfer, and environmental modifiers. For examples, CABLE has only nine carbon pools including three vegetation pools, three litter pools, and three soil pools, whereas CLM5 has 18 vegetation carbon pools and 140 soil carbon pools. Thus, the matrix equation has nine dimensions for CABLE and 158 for CLM5 if one grid is occupied by one vegetation type and 194 if one grid is occupied by three vegetation types. Then, the number of the carbon transfers increased exponentially with the number of pools. CABLE includes three vegetation carbon transfers whereas CLM5 has a total of 56 carbon transfers in vegetations. Moreover, CLM5 has much more detailed process representation related to fire, vertical carbon transfers in soil, and nitrogen dynamics than CABLE. Nevertheless, all those detailed processes in CLM5 are all folded into the five terms of matrix equation as in Equations 1, 2, 12, and 13. Thus, CABLE and CLM5 share similar mathematical expression and properties.
The matrix equations can be semianalytically solved to estimate initial pool sizes at steady state. This is semianalytic spin-up (SASU). In comparison, a traditional method that runs a model to the steady state usually takes thousands of years with cycled meteorological forcing. An accelerated decomposition (i.e., AD) method requires hundreds of cycles (Koven et al., 2013;Thornton & Rosenbloom, 2005). Thus, SASU requires tends of cycles to reach steady states and, thus, can reduce the computational cost by 1 or 2 orders of magnitude for spinning up global land models (Xia et al., 2012). The elevated computational efficiency by SASU enables parameter sensitivity analysis (Huang, Zhu, et al., 2018) and pool-based data assimilation (Hararuk et al., 2015;Hararuk & Luo, 2014;Shi et al., 2018). Sensitivity of the carbon storage from 34 parameters have been evaluated (Huang, Zhu, et al., 2018). The active layer depth among 34 parameters has been identified to play important role in predicting high-latitude soil organic carbon. No parameter sensitivity analysis on modeling soil organic carbon can be conducted previously with complex biogeochemical models. The efficient matrix-based sensitivity analysis can help effectively evaluate such models to improve our understanding.

Conclusion
We fully implemented the matrix approach to CLM5 biogeochemistry modules by reorganizing hundreds of mass balance equations of carbon and nitrogen cycles into four matrix equations. Dynamics in vegetation carbon, vegetation nitrogen, soil carbon, and soil nitrogen are separately represented. The CLM5 matrix model fully reproduced the structure and simulations of the original biogeochemistry modules of CLM5. Moreover, the matrix model provided additional benefits to modeling, such as modularity, effective diagnostics, and high computational efficiency. The simplicity and modularity in code benefit teaching and learning.
The matrix approach provides strong diagnostic capacity, such as traceability analysis, semianalytic attribution, and model structure comparison. Simulated carbon and nitrogen storage dynamics become more transparent and traceable than the original model.
The successful implementation of the matrix approach to CLM5 is a demonstration that most, if not all, the biogeochemical models can be represented in the matrix form. To date, about 30 models have been represented with the matrix form. Among them, CLM5 is among the most complicated models. Besides the benefit mentioned above, the standardized model structure will make biogeochemistry models more comparable. That is especially useful for model intercomparison projects.

Data Availability Statement
The code of the matrix model of CLM5, the model results, and the script for traceability analysis are available at this site (http://www2.nau.edu/luo-lab/download/Lu_2020_JAMES.php).