Bayesian reconstruction of past land-cover from pollen data: model robustness and sensitivity to auxiliary variables

Realistic depictions of past land cover are needed to investigate prehistoric environmental changes, effects of anthropogenic deforestation, and long term land cover-climate feedbacks. Observation based reconstructions of past land cover are rare and commonly used model based reconstructions exhibit considerable differences. Recently \citet[Spatial Statistics, 24:14--31,][]{PirzaLPG2018_24} developed a statistical interpolation method that produces spatially complete reconstructions of past land cover from pollen assemblage. These reconstructions incorporate a number of auxiliary datasets raising questions regarding the method's sensitivity to different auxiliary datasets. Here the sensitivity of the method is examined by performing spatial reconstructions for northern Europe during three time periods (1900 CE, 1725 CE and 4000 BCE). The auxiliary datasets considered include the most commonly utilized sources of past land-cover data --- e.g.\ estimates produced by a dynamic vegetation (DVM) and anthropogenic land-cover change (ALCC) models. Five different auxiliary datasets were considered, including different climate data driving the DVM and different ALCC models. The resulting reconstructions were also evaluated using cross-validation for all the time periods. For the recent time period, 1900 CE, the different land-cover reconstructions were compared against a present day forest map. The validation confirms that the statistical model provides a robust spatial interpolation tool with low sensitivity to differences in auxiliary data and high capacity to capture information in the pollen based proxy data. Further auxiliary data with high spatial detail improves model performance for areas with complex topography or few observations.


Introduction
The importance of terrestrial land cover for the global carbon cycle and its impact on the climate system is well recognized (e.g. Claussen et al., 2001;Brovkin et al., 2006;Arneth et al., 2010;Christidis et al., 2013). Many studies have found large climatic effects associated with changes in land cover.
Forecast simulations evaluating the effects of human induced global warming predict a considerable amplification of future climate change, especially for Arctic areas (Zhang et al., 2013;Richter-Menge et al., 2011;Chapman and Walsh, 2007;Miller and Smith, 2012). The past anthropogenic deforestation of the temperate zone in Europe was lately demonstrated to have an impact on regional climate similar in amplitude to present day climate change (Strandberg et al., 2014).
However, studies on the effects of vegetation and land-use changes on past climate and carbon cycle often report considerable differences and uncertainties in their model predictions (de Noblet-Ducoudré et al., 2012;Olofsson, 2013).
One of the reasons for such widely diverging results could be the differences in past land-cover descriptions used by climate modellers. Possible land-cover descriptions range from static present-day land cover (Strandberg et al., 2011), over simulated potential natural land cover from dynamic (or static) vegetation models (DVMs) (e.g. Brovkin et al., 2002;Hickler et al., 2012), to past land-cover scenarios combining DVM derived potential vegetation with estimates of anthropogenic land-cover change (ALCC) (Strandberg et al., 2014;Pongratz et al., 2008;de Noblet-Ducoudré et al., 2012).
Differences in input climates, mechanistic and parametrisation differences of DVMs (Prentice et al., 2007;Scheiter et al., 2013), and significant variation between existing ALCC scenarios (e.g. Kaplan et al., 2009;Pongratz et al., 2008;Goldewijk et al., 2011;Gaillard et al., 2010) further contribute to the differences in past land-cover descriptions. These differences can lead to largely diverging estimates of past land-cover dynamics even when the most advanced models are used (Strandberg et al., 2014;Pitman et al., 2009). Thus, reliable land-cover representations are important when studying biogeophysical impacts of anthropogenic land-cover change on climate.
The palaeoecological proxy based land-cover reconstructions recently published by Pirzamanbein et al. (2014Pirzamanbein et al. ( , 2018 were designed to overcome the problems described above. And to provide a proxy based land-cover description applicable for a range of studies on past vegetation and its interactions with climate, soil and humans. These reconstructions use the pollen based land-cover composition (PbLCC) published by Trondman et al. (2015) as a source of information on past land-cover composition.
The PbLCC are point estimates, depicting the land-cover composition of the area surrounding each of the studied sites. Spatial interpolation is needed to fill the gaps between observations and to produce continuous land-cover reconstructions. Conventional interpolation methods might struggle when handling noisy, spatially heterogeneous data (Heuvelink et al., 1989;de Knegt et al., 2010), but statistical methods for spatially structured data exist (Gelfand et al., 2010;Blangiardo and Cameletti, 2015).
In Pirzamanbein et al. (2018) a statistical model based on Gaussian Markov Random Fields (Lindgren et al., 2011;Rue and Held, 2005) was developed to provide a reliable, computationally effective and freeware based spatial interpolation technique. The resulting statistical model combines PbLCC data with auxiliary datasets; e.g. DVM output, ALCC scenarios, and elevation; to produce reconstructions of past land cover. The auxiliary data is subject to the differences and uncertainties outlined above and the choice of auxiliary data could influence the accuracy of the statistical model. The major objectives of this paper are: 1) To draw attention of climate modelling community to a novel set of spatially explicit pollen-proxy based land-cover reconstructions suitable for climate modelling; 2) to present and test the robustness of the spatial interpolation model developed by Pirzamanbein et al. (2018);and 3) to evaluate the models capacity to recover information provided by PbLCC proxy data and to analyse its sensitivity to different auxiliary datasets.

Material and Methods
The studied area covers temperate, boreal and alpine-arctic biomes of central and northern Europe Four different model derived datasets, depicting past land cover, along with elevation were considered as potential auxiliary datasets. In each case potential natural vegetation composition estimated by the DVM LPJ-GUESS (Lund-Potsdam-Jena General Ecosystem Simulator; Smith et al., 2001;Sitch et al., 2003) were combined with an ALCC scenario to adjust for human land use (see Pirzamanbein et al., 2014, for details): K-L RCA3 : Combines the ALCC scenario KK10 (Kaplan et al., 2009) and the potential natural vegetation from LPJ-GUESS. Climate forcing for the DVM was derived from RCA3 (Rossby Centre Regional Climate Model, Samuelsson et al., 2011)  is based on a combination of satellite data and national forest-inventory statistics from 1990-2005 (Pivinen et al., 2001;Schuck et al., 2002) (Figure 1, column 1 row 1). All auxiliary data were up-scaled to 1 • × 1 • spatial resolution, matching the pollen based reconstructions, before usage as model input.

Statistical Model for Land-cover Compositions
A Bayesian hierarchical model is used to interpolate the PbLCC data; here we only provide a brief overview of the model, mathematical and technical details can be found in Pirzamanbein et al. (2018).
The model can be seen as a special case of a generalized linear mixed model with a spatially correlated random effect. An alternative interpretation of the model is as an empirical forward model (direction of arrows in Figure 2) where parameters affect the latent variables which in turn affect the data.
Reconstructions are obtained by inverting the model (i.e. computing the posterior) to obtain the latent variables given the data.
The PbLLC derived proportions of land cover (coniferous forest, broadleaved forest and unforested land), denoted Y PbLCC , are seen as draws from a Dirichlet distribution (Kotz et al., 2000, Ch. 49) given a vector of proportions, Z, and a concentration parameter, α (controlling the uncertainty: V(Y PbLCC ) ∝ 1/α). Since the proportions have to obey certain restrictions (0 ≤ Z k ≤ 1 and 3 k=1 Z k = 1, were k indexes the land-cover types), a link function is used to transform between the Figure 1: Data used in the modelling. The first column shows (from top to bottom) the EFI-FM, SRTM elev , and the colorkey for the land-cover compositions, coniferous forest (CF), broadleaved forest (BF) and unforested land (UF). The remaining columns give (from left to right) the PbLCC (Trondman et al., 2015) and the four model based compositions considered as potential covariates: K-L RCA3 , K-L ESM , H-L RCA3 , and H-L ESM . Here K/H indicates KK10 (Kaplan et al., 2009) or HYDE (Goldewijk et al., 2011) land use scenarios and L RCA3 /L ESM indicates the climate -Rossby Centre Regional Climate Model (Samuelsson et al., 2011) or Earth System Model (Mikolajewicz et al., 2007) -used to drive the vegetation model. The three rows represent (from top to bottom) the time periods 1900 CE, 1725 CE, and 4000 BCE. Figure 2: Hierarchical graph describing the conditional dependencies between observations (white rectangle) and parameters (grey rounded rectangles) to be estimated. The white rounded rectangles are computed based on the estimations. In a generalized linear mixed model framework, η is the linear predictor -consisting of a regression term, µ, and a spatial random effect, X. The link function, f (η), transforms between linear predictor and proportions, which are matched to the observed land cover proportions, Y PbLCC , using a Dirichlet distribution.  (Becker et al., 2009), K/H indicates KK10 (Kaplan et al., 2009) or HYDE (Goldewijk et al., 2011) land use scenarios and L RCA3 /L ESM indicates vegetation model driven by climate from the Rossby Centre Regional Climate Model (Samuelsson et al., 2011) or from an Earth System Model (Mikolajewicz et al., 2007).
proportions and the linear predictor, η: Here f −1 (Z) is the additive log-ratio transformation (Aitchison, 1986), a multivariate extension of the logit transformation.
The linear predictor consists of a mean structure and a spatially dependent random effect, η = µ+X.
The mean structure is modelled as a linear regression, µ = Bβ; i.e. a combination of covariates, B, and regression coefficients, β. To aid in variable selection and suppress uninformative covariates a horseshoe prior (Park and Casella, 2008;Makalic and Schmidt, 2016) is used for β. The main focus of this paper is to evaluate the model sensitivity to the choice of covariates (i.e. the auxiliary datasets).
The PbLCC is modelled based on six different sets of covariates: 1) Intercept, 2) SRTM elev , 3) K-L ESM , 4) K-L RCA3 , 5) H-L ESM , and 6) H-L RCA3 ; illustrated in Figure 1. A summary of the different models is given in Table 1.
Finally, the spatially dependent random effect is modelled using a Gaussian Markov Random Field (Lindgren et al., 2011) with two parameters: κ, controlling the strength of the spatial dependence and

Testing the Model Performance
To evaluate model performance, we compared the land-cover reconstructions from different models for the 1900 CE time period with the EFI-FM by computing the average compositional distances (Aitchison et al., 2000;Pirzamanbein et al., 2018). This measure is similar to root mean square error in R 2 but it accounts for compositional properties (i.e. 0 ≤ Z k ≤ 1 and 3 k=1 Z k = 1).
Since no independent observational data exists for the 1725 CE and 4000 BCE time periods, we applied a 6-fold cross-validation scheme (Hastie et al., 2001, Ch. 7.10) to all models and time periods.

Results and Discussion
Fossil pollen is a well-recognized information source of vegetation dynamics and generally accepted as the best observational data on past land-cover composition and environmental conditions (Trondman et al., 2015).
Today, central and northern Europe have, at the subcontinental spatial scale, the highest density of palynologically investigated sites on Earth. However, even there the existing pollen records are irregularly placed, leaving some areas with scarce data coverage . The collection of new pollen data to fill these gaps is very time consuming and cannot be performed everywhere. All this makes pollen data, in their original format, heavily underused, since the data is unsuitable for models requiring continuous land-cover representations as input. The lack of spatially explicit proxy based land cover data directly usable in climate models has been hampering the correct representation of past climate-land cover relationship.
Regrettably, the commonly used DVM derived representations of past land cover exhibit large variation in vegetation composition estimates. The model derived land-cover datasets used as auxiliary data (Table 1) show large variation in estimated extents of coniferous and broadleaved forests, and unforested areas for all of the studied time periods (Figure 1). These substantial differences illustrate large deviances between model based estimates of the past land-cover composition due to differences in applied climate forcing and/or ALCC scenarios. Differences in climate model outputs (Harrison et al., 2014;Gladstone et al., 2005) and ALCC model estimates (Gaillard et al., 2010) have been recognized in earlier comparison studies and syntheses. The effect of the differences in input climate forcing and ALCC scenario on DVM estimated land-cover composition presented here are especially pronounced for central and western Europe, and for elevated areas in northern Scandinavia and the Alps ( Figure   1). In general the KK10 ALCC scenario produces larger unforested areas, notably in western Europe, compared to the HYDE scenario. Compared to the ESM climate forcing; the RCA3 forcing results in higher proportions of coniferous forest, especially for central, northern and eastern Europe. The described differences are clearly recognizable for all the considered time periods and are generally larger between time periods than within each time period. The purpose of the statistical model presented in Section 2.1 is to combine the observed PbLCC with the spatial structure in the auxiliary data to produce data driven spatially complete maps of past land-cover that can be used directly (as input) in others models.
To illustrate the structure of the statistical model, step by step advancement from auxiliary data    Table 1) with different auxiliary datasets. Locations and compositional values of the available PbLCC data are given by the black rectangles. Middle row shows the compositional distances between each model and the Constant model. Bottom row shows the compositional distances between each model and the EFI-FM.  Table 1) with different auxiliary datasets. Locations and compositional values of the available PbLCC data are given by the black rectangles. Third and fourth row show the compositional distances between each model and the Constant model.
for the areas with low observational data coverage (e.g. eastern and south-eastern Europe) is improved by including covariates that exhibit distinct spatial structures for the given areas (Figures 5 and 6).
Neither the DIC results nor the 6-fold cross-validation results show any advantage among the six tested models for the different time periods (Table 2). Analogous to the reconstructions, the predictive regions are very similar in both size and shape irrespective of the auxiliary dataset used, indicating similar reconstruction uncertainties across all models (Figure 7). Implying there is no clear preference among the models, i.e. that the results are robust to the choice of auxiliary dataset.  Table 2: Deviance information criteria (DIC) and Average compositional distances (ACD) from 6-fold cross-validations for each of the six models and three time periods. Best value for each time period marked in bold-font.

DIC
Although a temporal misalignment exists between the PbLCC data for the 1900 CE time period (based on pollen data from 1850 to the present) and the EFI-FM (inventory and satellite data from 1990-2005); EFI-FM provides the best complete and consistent land cover map of Europe for present time, making it a reasonable choice for a comparison. The main differences between the EFI-FM and the PbLCC data for the 1900 CE time period are: 1) lower abundance of broadleaved forests for most of Europe, 2) higher abundance of coniferous forest in Sweden and Finland, and 3) higher abundance of unforested land in North Norway in the EFI-FM data than in the PbLCC data (Pirzamanbein et al., 2018). The average compositional distances computed between the land-cover reconstructions and the EFI-FM for 1900 CE show practically identical (1.47 to 1.48) distances between all six reconstructions and the EFI-FM, and small differences among the six presented models (Table 3).
These results clearly show that the developed statistical interpolation model is robust to the choice of covariates. The model is suitable for reconstructing spatially continuous maps of past land cover from scattered and irregularly spaced pollen based proxy data.

Conclusions
The statistical model and Bayesian interpolation method presented here has been specially designed for handling irregularly spaced palaeo-proxy records like pollen data and, dependent on proxy data availability, is globally applicable. The model produces land-cover maps by combining irregularly distributed pollen based estimates of land cover with auxiliary data and a statistical model for spatial structure. The resulting maps capture important features in the pollen proxy data and are reasonably insensitive to the use of different auxiliary datasets.
Auxiliary datasets considered were complied from commonly utilized sources of past land-cover data (outputs from a dynamic vegetation model using different climatic drivers and anthropogenic land-cover changes scenarios). These datasets exhibit considerable differences in their recreation of the past land cover. Emphasizing the need for the independent, proxy based past land-cover maps created in this paper.
Evaluation of the model's sensitivity indicates that the proposed statistical model is robust to the choice of auxiliary data and only considers features in the auxiliary data that are consistent with the proxy data. However, auxiliary data with detailed spatial information considerably improves the interpolation results for areas with low proxy data coverage, with no reduction in overall performance.
This modelling approach has demonstrated a clear capacity to produce empirically based land-cover reconstructions for climate modelling purposes. Such reconstructions are necessary to evaluate anthropogenic land-cover change scenarios currently used in climate modelling and to study past interactions between land cover and climate with greater reliability. The model will also be very useful for producing reconstructions of past land cover from the global pollen proxy data currently being produced by the PAGES (Past Global changES) LandCover6k initiative 1 .

Data availability
The database containing the reconstructions of coniferous forest, broadleaved forest and unforested land, three fractions of land cover, for the three time-periods presented in this paper, along with reconstructions for 1425 CE and 1000 BCE using only the K-L ESM are available for download from https://github.com/BehnazP/SpatioCompo. In addition the source code is available in the same repository under the open source GNU General Public License.
Acronyms DVM Dynamical vegetation model.

PbLCC
Pollen based land-cover composition.

EFI-FM
European Forest Institute forest map.
f Link function, transforming between proportions and linear predictor.
X Spatially dependent random effect.
α Concentrated parameter of the Dirichlet distribution (i.e. observational uncertainty) Σ Covariance matrix that determines the variation between and within fields κ Scale parameter controlling the range of spatial dependency