Global Core Top Calibration of δ18O in Planktic Foraminifera to Sea Surface Temperature
Abstract
The oxygen isotopic composition of planktic foraminiferal calcite (
) is one of the most prevalent proxies used in the paleoceanographic community. The relationship between
, temperature, and seawater oxygen isotopic composition (
) is firmly rooted in thermodynamics, and experimental constraints are commonly used for sea surface temperature (SST) reconstructions. However, in marine sedimentary applications, additional sources of uncertainty emerge, and these uncertainty constraints have not as of yet been included in global calibration models. Here, we compile a global data set of over 2,600 marine sediment core top samples for five planktic species: Globigerinoides ruber, Trilobatus sacculifer, Globigerina bulloides, Neogloboquadrina incompta, and Neogloboquadrina pachyderma. We developed a suite of Bayesian regression models to calibrate the relationship between
and SST. Spanning SSTs from 0.0 to 29.5 °C, our annual model with species pooled together has a mean standard error of approximately 0.54‰. Accounting for seasonality and species-specific differences improves model validation, reducing the mean standard error to 0.47‰. Example applications spanning the Late Quaternary show good agreement with independent alkenone-based estimates. Our pooled calibration model may also be used for reconstruction in the deeper geological past, using modern planktic foraminifera as an analog for non-extant species. Our core top-based models provide a robust assessment of uncertainty in the
paleothermometer that can be used in statistical assessments of interproxy and model-proxy comparisons. The suite of models is publicly available as the Open Source software library bayfox, for Python, R, and MATLAB/Octave.
Key Points
- We develop Bayesian calibration models for planktic
- Accounting for seasonal abundance and species-specific sensitivities improves inference
- Our models produce realistic SST reconstructions for both recent and “deep-time”
data
1 Introduction











Synthetic calcite studies indicate that under equilibrium conditions,
sensitivity to temperature is approximately −0.19‰ · ° C−1 near 30 °C and approximately −0.25‰ · °C−1 near 0 °C (Kim & O'Neil, 1997). Calibration studies using foraminifera from lab cultures and marine tows yield comparable temperature sensitivities (−0.20 to −0.28‰ · °C−1; e.g., Bemis et al., 1998; Bouvier-Soumagnac & Duplessy, 1985; Mulitza et al., 2003a; Shackleton, 1974) although in some cases the foraminiferal calibrations are offset (0.2–0.8‰) from inorganic calibrations (e.g., Caron et al., 1990; Duplessy & Blanc, 1981; Waelbroeck et al., 2005). The error in calibrations from culture or plankton tows is greater than the error of inorganic calibrations; for example, lab culture calibrations have standard errors of 0.15‰ (e.g., Bemis et al., 1998) and plankton tow calibrations of 0.26‰ (e.g., Mulitza et al., 2003a).
While there is general agreement of
sensitivity between synthetic calcite experiments and foraminiferal-based investigations, application of
paleothermometry to marine sediment records poses additional challenges and unavoidable sources of uncertainty. For one,
is not precisely known for past oceanographic conditions and must be estimated, potentially introducing a large source of uncertainty under both present and past oceanic conditions. To a leading order,
reflects the precipitation-evaporation balance over the open ocean, but it is also modified by local and regional processes such as ice formation, glacial meltwater, seasonal freshwater runoff, water mass advection, and mixing (Craig & Gordon, 1965). The climatological processes influencing
, coupled with the scarcity of measurements in many regions of the modern ocean, can lead to large uncertainties in
for certain locations such as the high latitudes (LeGrande & Schmidt, 2006). Moving back through time,
distributions must either be estimated through a priori assumptions about oceanographic setting or predicted by isotope-enabled climate models. Alternatively, research questions about
can be addressed by reconstructing temperature with another independent proxy and then isolating
from
.
In foraminiferal calcite, the uncertainty of shell
temperature calibrations is influenced by biological processes, such as photosynthesis in algal symbionts (e.g., Duplessy et al., 1970; Ravelo & Fairbanks, 1992; Spero & Lea, 1993; Spero et al., 1997) and biases in the formation of gametogenic and ontogenetic calcite (e.g., Hamilton et al., 2008; Spero & Lea, 1996; Williams et al., 1979). In addition, each species exhibits a distinct seasonality and depth habitat in the water column (e.g., Fairbanks & Wiebe, 1980; Fairbanks et al., 1982; Kohfeld et al., 1996; Sautter & Thunell, 1989, 1991; Ž̆̆arić et al., 2005), and even within the morphospecies commonly used for classification of fossil foraminifera, there may be additional differences in life cycle and habitat preferences due to the genotypic diversity (Aurahs et al., 2011; Darling & Wade, 2008; Kucera & Darling, 2002). While such ecological relationships can be leveraged for season- and depth-specific climate reconstructions (e.g., Mulitza et al., 2003b; Spero et al., 2003; Williams et al., 1981), these relationships can change through time in response to environmental or biological variations, complicating paleoenvironmental interpretations (e.g., Mulitza et al., 1998). Changes in pH/carbonate ion concentration [CO
] during calcification also influence
(Bijma et al., 1999; Spero et al., 1997; Zeebe, 1999). Finally, the sedimentary environment influences the fidelity of
as it is preserved downcore. Shells deposited in bottom waters undersaturated in [CO
] may partly dissolve and recrystallize, a process that alters the original isotopic signature via exchange with pore water δ18O (e.g., Schrag et al., 1995). Bioturbation can be an especially strong source of core top variability in areas with low sedimentation rates, where glacial age sediments may become mixed with relatively modern sediments in a core top sample (Waelbroeck et al., 2005).
Many paleoceanographic applications use laboratory calibrations to transform
data to sea surface temperatures (SST), but these calibrations do not capture the range of biological, chemical, and sedimentological uncertainties enumerated above. It is important to capture these uncertainties in order to realistically estimate paleo-SSTs from the marine sediment archive and critical for multiproxy or climate model-proxy comparisons. In this study, we develop a calibration for the
-SST relationship in planktic foraminifera using core top data and Bayesian regression. We use a Bayesian approach which explicitly models the uncertainty in calibration model parameters and then propagates this uncertainty into inferred SSTs, facilitating probabilistic estimates of past climate (Tierney & Tingley, 2014, 2018). In developing Bayesian regression for
, we pay particular attention to species-specific differences and seasonal abundance. Depth habitat is also an important factor, but in this work we focus specifically on near-surface (mixed layer) dwelling species that are frequently used to reconstruct SST, for which slight differences in depth habitat are less likely to appreciably affect regression models. Calibration of deep-dwelling species is left to future efforts.
In what follows, we develop mean annual and seasonal calibration models for five planktic species commonly used in paleoceanography: Globigerinoides ruber, Trilobatus sacculifer, Globigerina bulloides, Neogloboquadrina incompta, and Neogloboquadrina pachyderma. We also develop a model that pools these species together for application to extinct species of planktic foraminifera commonly used for Cenozoic paleoceanographic reconstructions (e.g., Zachos et al., 1994). These models are freely available to researchers as a software library called bayfox. We give examples of how our calibrations can be applied for paleotemperature reconstructions over the Late Quaternary as well as in deeper geologic time, comparing our calibration and its uncertainty with established Bayesian alkenone
and TEX86 reconstructions.
2 Methods and Data Selection
2.1 SST and Seawater

Our Bayesian models use modern SSTs and
as predictors. For SSTs, we used both monthly and annual fields from the World Ocean Atlas 2013 version 2 (Boyer et al., 2013). For
, we used the top layer of the estimated annual fields from LeGrande and Schmidt (2006). Both of these products have 1° × 1° spatial resolution. The LeGrande and Schmidt (2006)
field is based on
observations from the last half-century, and in areas with sparse isotope sampling coverage, it uses regional
-salinity relationships to estimate isotope values. The
field does not include uncertainty estimates for grid points, though LeGrande and Schmidt (2006) note that annual average values in regions near or under sea ice may be more uncertain due to a limited number of observations and large seasonal fluctuations in runoff and precipitation that induce high variance in
.
2.2 Core Top Planktic Foraminiferal

We compiled planktic
sediment core records for five foraminiferal species. G. ruber, G. bulloides, N. incompta, and N. pachyderma. G. ruber (white), and G. ruber (pink) have some regional differences in seasonality and relative abundance (e.g., Bé, 1960; Williams et al., 1981). However, we opted to evaluate G. ruber white and pink together because our preliminary G. ruber (pink) calibration was strongly influenced by sites off the northwest coast of Africa, leading to model parameters that we believe may reflect sampling and statistical artifacts more than differences in G. ruber (pink) calcification.
Core top and Late Holocene records were gathered from the Multiproxy Approach for the Reconstruction of the Glacial Ocean data set (Waelbroeck et al., 2005) which extends the collection of Schmidt and Mulitza (2002). We supplemented this collection with additional sources (Arbuszewski et al., 2010, 2013; Boussetta et al., 2012; Brown & Elderfield, 1996; Cléroux et al., 2008; Dahl & Oppo, 2006; Dekens et al., 2002; Dyez et al., 2014; Elderfield & Ganssen, 2000; Fallet et al., 2012; Farmer, 2005; Ganssen & Kroon, 2000; Garidel-Thoron et al., 2007; Gebregiorgis et al., 2016; Gibbons et al., 2014; Johnstone et al., 2011; Kozdon et al., 2009; Lea et al., 2006; Leduc et al., 2007; Linsley et al., 2010; Mashiotta et al., 1999; Mathien-Blard & Bassinot, 2009; Meland et al., 2006; Moffa-Sánchez et al., 2014; Mohtadi et al., 2010, 2011; Nürnberg et al., 2008; Oppo et al., 2009; Oppo & Sun, 2005; Pahnke et al., 2003; Palmer & Pearson, 2003; Parker et al., 2016; Regenberg et al., 2009; Richey et al., 2007; Riethdorf et al., 2013; Riveiros et al., 2016; Romahn et al., 2014; Rosenthal et al., 2003; Rustic et al., 2015; Sabbatini et al., 2011; Saraswat et al., 2013; Schmidt et al., 2012a, 2004, 2012b; Steinke et al., 2005, 2008; Steph et al., 2009; Stott et al., 2007; Sun et al., 2005; Thornalley et al., 2011; Tierney et al., 2016; Visser et al., 2003; Weldeab et al., 2005, 2006, 2007, 2014; Werner et al., 2013; Xu et al., 2010).
We excluded records from sites with annual SST ≤ 0 °C to reduce complications from local sea ice formation and poor
estimates. Waelbroeck et al. (2005) show that age and sedimentation rate filtering can help reduce uncertainty stemming from the ambiguous “modern” age constraints. However, we consider these to be important sources of uncertainty to include in our calibration, so we did not filter core top sites by age or sedimentation rates. This also means that we are calibrating ambiguously modern core top samples against SST and
fields influenced by anthropogenic climate change. This is an issue that affects all core top calibrations.
Our compilation consists of 2,636 observations (Figure 1a) with 1,002 for G. ruber, 635 for G. bulloides, 442 for T. sacculifer, 425 for N. pachyderma, and 132 for N. incompta. We then gridded the core top data to reduce the impact of spatial clustering by averaging samples for each species to the nearest 1° × 1°grid point of our SST and
fields. After gridding, there were a total of 1,386 grid points, with 489 for G. ruber, 291 for G. bulloides, 273 for N. pachyderma, 243 for T. sacculifer, and 90 for N. incompta. References to core top data hereafter refer to the gridded core top data unless noted otherwise.

The core top data cover a wide range of modern SST values and reflect the general thermal preferences of each species (Table 1); for instance, G. ruber prefers relatively warmer waters, while G. bulloides is abundant across a wide range of temperatures, and N. incompta and N. pachyderma prefer cooler waters.
Modern annual SST (°C) | |||||
---|---|---|---|---|---|
Group | n | Min | Max | Mean | σ |
Globigerinoides ruber | 489 | 10.9 | 29.6 | 24.9 | 3.8 |
Trilobatus sacculifer | 243 | 10.6 | 29.6 | 24.5 | 4.0 |
Globigerina bulloides | 291 | 1.8 | 29.6 | 13.6 | 6.9 |
Neogloboquadrina incompta | 90 | 2.6 | 19.6 | 11.5 | 4.7 |
Neogloboquadrina pachyderma | 273 | 0.1 | 21.4 | 6.1 | 4.3 |
Pooled | 1,386 | 0.1 | 29.6 | 17.8 | 9.0 |
- Note. n values are sample size for gridded core top data.
2.3 Estimation of Foraminiferal Seasonal Abundance
The abundance of individual planktic foraminiferal species varies seasonally in response to changes in temperature and nutrients, which affect food availability (e.g., Williams et al., 1979, 1981). This motivates the development of a seasonally adjusted calibration model, wherein foraminiferal seasonal expression is modeled as function of their preferred temperature range. To build such a model, we use sediment trap data compiled by Ž̆̆arić et al. (2005) to identify temperature ranges that correspond to peak abundance for each foraminiferal species. The Ž̆̆arić et al. (2005) data set pairs total foraminifera shell flux and local SSTs from 75 sites (Figure 1b). The data set contains a total of 5,548 observations with 1,807 for G. ruber, 1,034 for T. sacculifer, 1,255 for G. bulloides, 910 for N. incompta, and 542 for N. pachyderma.
To adjust for nonnormally distributed shell flux data, we applied a Box-Cox power transformation (Box & Cox, 1964) for comparison with SST (Figure 2). A Kernel Density Estimate was fit to the observations to estimate the SST interval that corresponds with the highest 10% (most abundant) flux observations, which is taken to represent their “ideal” thermal niche, similar to the approach in Ž̆̆arić et al. (2005).

We then used these SST ranges to estimate the most likely seasons of peak shell flux for each core top foraminiferal observation, by averaging SSTs from all months at the location of the observation that fell within the SST range. If, for a given observation, no monthly temperatures fell within the peak abundance range, we took the average of the three monthly SSTs closest to the range. If less than 3 months fell within the range, we included the next closest months so as to guarantee a seasonal average of at least three months. The resulting seasonality estimates show that at core sites in the tropics, foraminiferal species abundances are typically annual or near annual, without a strong seasonal signal (Figure 3). For the warm-water species G. ruber and T. sacculifer, a stronger seasonal signal (typically Summer-Fall) appears outside of the tropics. Cold-water species are generally predicted to be annual within their expected extratropical ranges, although seasonal expressions do occur for locations with particularly broad annual temperature ranges (e.g., Figure S4 in the supporting information). Monthly maps of predicted niche SST ranges for each species are available in the supporting information (Figures S1–S5).

We recognize that this method of identifying seasonal signals in foraminiferal abundance is a simplification that does not consider factors such as light and nutrient availability. We chose this approach because it is easily replicated and used with a large global sample data set. This approach can be adapted more broadly for forward modeling experiments where seasonality or monthly SSTs change relative to modern conditions.
2.4 Bayesian Calibration Models
We designed and fit four linear Bayesian regression models with Markov chain Monte Carlo (MCMC) sampling (for review see Gelman, 2014; Kruschke, 2015; McElreath, 2016). With a Bayesian approach, we can explicitly estimate the uncertainty in calibration model parameters and produce a full prediction posterior distribution of the predictant (
or SST), rather than a single-point estimate. We designed four models to uniquely consider seasonal and species-specific adjustments to calibration. We compared the performance of these four models with cross-validation statistics. These statistics were used to objectively assess whether considering species-specific variability and seasonality resulted in model improvement.





Our third and fourth models use seasonal SSTs in the place of annual SSTs, while retaining the same pooled and hierarchical model designs described above. The seasonal SSTs are based on seasonal peak abundance estimated from a network of marine sediment traps, as described in section 2.3. These seasonal models use annual
estimates, as monthly fields are not available from LeGrande and Schmidt (2006).
All calibration models were cross validated using Pareto-Smoothed Importance Sampling leave-one-out cross validation (LOOCV; Vehtari et al., 2017). This statistic measures model predictive performance by approximating a LOOCV—estimating each core top sample as though it were left out of the calibration for validation purposes. A relatively low score is “better,” indicating improved performance.
Our four models are “forward models”—that is, they predict
given SST and
—but they can also be inverted, via Bayesian inference, to predict SST. To predict SST for a given
and
, parameters are drawn from the full conditional posteriors of the calibration models for the likelihood and then combined with a prior distribution of SSTs to yield a posterior. The SSTs in Table 1 are a suggested starting point to develop a prior distribution, though users can specify values to fit different environmental settings.
Additional description of priors, hyperparameters, model inversion, and MCMC sampling is given in the appendix. We implemented the Bayesian models for this analysis with the pymc3 library (Salvatier et al., 2016) on an Open Source Python software stack (Hunter, 2007; Hoyer & Hamman, 2017; McKinney, 2010; Met Office, 2010; Oliphant, 2015). Code implementing our analysis is available online (https://github.com/brews/d18oc_sst). The calibration models are available for broader use with the Open Source bayfox software library, described in section 4.7.
3 Results and Discussion
The four Bayesian calibration models differ in whether they account for species-specific differences (“pooled” vs. “hierarchical”) or the seasonality of foraminiferal abundance (“annual” vs. “seasonal”). Our “pooled annual” model combines all five species together and calibrates core top
to annual mean SSTs. The “hierarchical annual” model also calibrates to annual mean SSTs but allows the calibration parameters to vary for each species. The “pooled seasonal” model and “hierarchical seasonal” model use the same pooled and hierarchical designs but are calibrated with our seasonal SST estimates—based on sediment trap fluxes (see section 2.3)—instead of annual SSTs.
3.1 The Core Top Relationship With SST
Within the core top data used for calibration, the relationship between annual SST and core top isotopic fractionation (
-
) is strongly negative for both the pooled data set (r= −0.97; p ≪ 0.01) and individual species data sets. The relationship sits close to previous calibrations based on inorganic calcite precipitation, live cultures, and plankton tows (Bemis et al., 1998; Kim & O'Neil, 1997; Mulitza et al., 2003a; O'Neil et al., 1969; Figure 4). Despite this, the core top data have notable spread and deviations relative to previous calibrations, which can be understood as the expression of uncertainty related to sedimentological factors (to glacial sediment mixing from bioturbation, loss of core top material when coring, and low sedimentation rates), biological factors (seasonal abundance and vital effects), and uncertainties in
.



All four Bayesian calibration models reasonably replicate core top data spread when we predict core top fractionation (Figure 5). Calibration model spread is measured as the mean sample standard deviation of the posteriors (
). The pooled models have larger spread (Figure 5;
= 0.54‰ for the pooled annual model,
= 0.51‰ for the pooled seasonal model, 0.47‰ for the hierarchical annual model, and 0.49‰ for the hierarchical seasonal model). These
values correspond to uncertainties in SST of approximately 2.5–2.8 °C (with β =−0.19) or 1.9–2.1 °C (with β = −0.25), depending on the regression slope (β) of the calibration model. For comparison, the Orbulina universa high-light and low-light culture calibrations of Bemis et al. (1998) have standard errors between 0.10‰ and 0.15‰, which represents 0.5 to 0.7 °C using their calibration β of −0.21. The wider standard errors of species-specific plankton tow-based calibration of Mulitza et al. (2003a) range from 0.21‰ to 0.32‰ (0.9 and 1.2 °C at their calibration β= −0.23 and β = −0.21, respectively) likely resulting from the broader range of depth habitats and thermocline structure captured during plankton tow collection. The prediction spread is larger in our calibration models because our compilation of global core top data contains a wider range of uncertainty (e.g., calcification depth, postdepositional bioturbation, and differences in depositional age).


The posterior distributions of the slope parameter (the sensitivity between seawater temperature and
; β) for the pooled annual and pooled seasonal models likewise fall near the bounds of thermodynamic expectation (Figure 6), with nearly identical mean values of −0.23. In the hierarchical models, β is allowed to deviate between species (βi) resulting in greater variability (Figure 6). In four of the five species, the median value of βi in the hierarchical annual model is lower than thermodynamic expectation (Figure 6). This tendency for lower sensitivity is also apparent in the posterior for the β hyperparameter—the slope parameter shared across all species—which is −0.18 ± 0.05 (95% CI) in the hierarchical annual model. In contrast, the hierarchical seasonal β hyperparameter more closely matches that of equilibrium fractionation (−0.21 ± 0.04, 95% CI). However, species-specific offsets in sensitivity do persist; in particular, βi is lower for N. incompta and T. sacculifer. These differences could simply be related to the narrower range of temperatures used for most of the species-specific calibrations (Bemis et al., 2002); for instance, βi values in the seasonal model are close to the inorganic sensitivity, and G. bulloides, which has a wider temperature range than the other species, has a posterior βi that is very similar to the pooled calibration values (Figure 6).

3.2 Cross-Validation Statistics and Model Comparison
Model cross-validation statistics (LOOCV; Figure 7) suggest that predictive performance of the model improves when we account for seasonality of foraminiferal abundance and species-specific differences. The pooled seasonal model (LOOCV = 2,063 ± 65) outperforms the pooled annual model (LOOCV = 2,248 ± 74), and predictive performance further improves when we incorporate species-specific differences in the hierarchical annual model (LOOCV = 1,783 ± 68). The hierarchical seasonal model shows a decrease in validation performance (LOOCV = 1,952 ± 66) compared to the hierarchical annual model, but this does not necessarily mean that seasonality is unimportant. The lack of improvement partly reflects the fact that under the hierarchical model design, seasonal abundance differences can be subsumed under species-specific differences. This said, patterns in the residuals suggest that data from the Mediterranean region may explain the increased LOOCV in the hierarchical seasonal model (see discussion below).

3.3 Model Residual Trends and Spatial Patterns
Taken as a whole, the residuals of the pooled annual calibration model are well behaved (Figure 8a). However, if we use the pooled annual model to predict
for individual species, strong trends emerge. At high SSTs, the model predicts more negative
than observed, and at low SSTs, the model predicts more positive
than observed, for all species except G. bulloides (Figure 8b). Predictions improve when the pooled seasonal model is used; residuals still retain similar species-specific trends (Figure 9a), but they are less severe, particularly for G. ruber and N. pachyderma. When the hierarchical calibration model is used for species-specific predictions, the trends in residuals are eliminated (Figures 10a and 11a). This, along with the LOOCV statistics, emphasizes that different species in the core top data have distinct
responses to temperature likely due to different lifestyles, depth habitats, and vital effects. Prediction is clearly improved by accounting for these differences.




Model residuals also show spatially coherent patterns. For example, residuals are generally negative for G. ruber and T. sacculifer in the Western Pacific Warm Pool and positive for G. bulloides near the Southern Ocean (e.g., Figure 8b). These patterns are most prominent in the pooled calibration models (Figures 8b and 9b) and in some cases are alleviated by explicitly accounting for seasonality and species-specific offsets in the pooled seasonal and hierarchical seasonal models, respectively (Figure 11). However, some residual structures persist and may reflect biological responses to true geographic differences in secondary environmental parameters, such as gradients in
, nutrients, and light penetration.
For example, G. bulloides has a demonstrated preference for nutrient-rich waters with high turbidity, and seasonal abundance of G. bulloides is strongly tied to regional upwelling patterns (Abrantes et al., 2002; Gibson et al., 2016). Thus, G. bulloides
may be strongly skewed to record temperatures from cold, upwelled waters. The positive
residuals for G. bulloides in the Benguela Current, Peru Current, and other upwelling sites along the Southern Ocean boundary may reflect this habitat preference; these positive residuals persist even when accounting for seasonality and species-specific sensitivity (Figures 8-11). In the southwestern Atlantic Ocean, positive residuals (model predicts warmer temperatures) may reflect a response of G. bulloides to increased turbidity from seasonal river discharge from the Rio de la Plata estuary and/or variations in nutrient delivery related to wind direction and interactions with the Brazil Current (Piola et al., 2005). These factors might skew G. bulloides abundance to cooler seasons and/or a deeper depth habitat.
We also observe persistent positive mean
residual values for G. ruber in the Mediterranean region in both the pooled seasonal and hierarchical seasonal models (Figures 9b and 11b). This group of residuals from the Mediterranean substantially contributes to the reduced performance of the seasonal hierarchical model (LOOCV = 1,952 ± 66) over the annual hierarchical (LOOCV = 1,783 ± 68) model. With Mediterranean core top data excluded, the two hierarchical models perform similarly (1,716 ± 66 and 1,691 ± 67, respectively).
The Mediterranean bias in the seasonal models suggests that either our estimation of seasonality for this region is problematic or that there are other factors influencing foraminiferal calcification that are not accounted for in our basic model setup. Our seasonality estimates predict that G. ruber abundance and shell flux should be skewed toward boreal summer and fall, which agrees well with plankton tow surveys in the Mediterranean (e.g., Pujol & Grazzini, 1995). We therefore favor the second explanation; that is, that there is another mechanism that might explain the Mediterranean model residuals. We can rule out issues related to bioturbation because the majority of the core top samples for these sites have age control with dates placing them in the Holocene or Late Holocene (Sabbatini et al., 2011). Postdepositional overgrowth from the supersaturation of bottom water calcite is known to complicate the use of the foraminiferal Mg/Ca proxy in the Mediterranean (e.g., Kontakiotis et al., 2011) and could feasibly also account for the positive
residuals. However, if this were the case, we would expect to see strong positive residuals in other Mediterranean species and not just G. ruber. Another explanation for this apparent bias in the Mediterranean relates to habitat depth. During the summer-fall season, G. ruber expands its depth habitat within the water column and therefore may be recording cooler temperatures. This is supported by substantial abundance of G. ruber observed down to 100 m in the summer-fall season in the Mediterranean (Pujol & Grazzini, 1995). Yet another source of uncertainty is the seasonal changes in
and salinity (of 1PSU) observed in the Mediterranean (e.g., MEDAR Group, 2002), which are not captured in the LeGrande and Schmidt (2006)
data.
We also observe a recurring pattern of positive
residuals (warmer predicted temperatures) near major oceanic frontal boundaries. This occurs for G. bulloides along the Southern Ocean front, for N. incompta near the confluence of the Agulhas and Benguela Currents, and for N. pachyderma near the boundary between the Labrador Current and Gulf Stream in the North Atlantic (Figure 11b). Such residual patterns could reflect advection of foraminifera shells which calcified along the colder side of these boundaries (Martínez-Méndez et al., 2010). However, sediment trap studies suggest that the rapid settling rate of foraminifera shells is unlikely to result in a strong advection bias (e.g., King & Howard, 2005), and differences in habitat related to water masses may be strongly dependent on species (e.g., Dyez et al., 2014). An alternate explanation is that sharp hydrological gradients in these frontal regions may bias the estimation of
from limited measurements (LeGrande & Schmidt, 2006; Waelbroeck et al., 2005). With a relatively coarse 1° × 1°fields, our estimates of SST and especially
in boundary regions between water masses are likely poorly characterized.
A related issue is that assumptions about local
-salinity relationships may introduce a region-specific uncertainty in
, which is translated into highly coherent spatial patterns in
residuals. For example, G. bulloides, G. ruber, and T. sacculifer all have negative
mean residuals (generally colder predicted temperatures or higher predicted
than observed) in the eastern boundary upwelling regions of the Atlantic Ocean (Figure 11b). These trends are present in both the annual and seasonal versions of the hierarchical model. It is unlikely that these negative residuals represent a biological bias. Sediment trap studies off the coast of Saharan Africa demonstrate that G. ruber is more abundant during the winter months (Abrantes et al., 2002), yet the negative model residuals in
suggest that G. ruber is calcifying in warmer SSTs than predicted by the model. Here, a seasonal bias in the abundance of warm-water species is clearly not the cause of the geographic bias in model residuals, and a more likely candidate can be found in the predictor variables.
3.4 Summary of Calibration Model Performance
All four of our calibration models replicate the center and spread in the core top record (with model
from 0.54‰ to 0.47‰) while reasonably reproducing the temperature-
relationship given in established equilibrium calcite relationships (e.g., Kim & O'Neil, 1997; O'Neil et al., 1969), as well as relationships determined for foraminiferal calcite from live culture and plankton tows (e.g., Bemis et al., 1998; Mulitza et al., 2003b). There are species-specific trends in the residuals for both pooled models; these are eliminated in the hierarchical models as the latter allow the regression parameters to vary by species. All models show some residual bias along oceanic fronts and upwelling zones, where dynamic hydrography may introduce complex seasonal patterns in abundance and habitat depth. Overall, model performance improves when accounting for foraminiferal seasonality and species-specific variability, with the exception of the hierarchical seasonal model versus the hierarchical annual model. As discussed, the reduced performance in the hierarchical seasonal model is related to the unusual behavior of G. ruber data in the Mediterranean, which may reflect depth habitat migration. Even though the hierarchical seasonal model objectively performs worse by the LOOCV metric, it produces posterior distributions of temperature sensitivity (βi) in closer agreement with thermodynamic expectations (Figure 6). As we discuss below, the pooled annual model is a more appropriate choice for applications to extinct planktic species in the geologic past.
4 Examples and Applications
A key benefit of our Bayesian core top calibrations is that the models can propagate uncertainty from calibration into predictions about past climate conditions. We demonstrate this using several downcore examples from different oceanographic settings. For each example, we apply our calibration models to foraminiferal
and compare our results with SSTs inferred from independent organic geochemistry records from the same core sites (either
or TEX86 data) or sediment trap measurements. Inference of SST from
requires priors on SST means and standard deviations. We take these from Table 1, multiplying the standard deviation by 2. The
data are ice volume corrected before calibration to remove the changes in global
associated with ice sheets (see appendix). In all cases we use modern, annual
estimates from LeGrande and Schmidt (2006), unless noted otherwise.
4.1 G. bulloides in the Mediterranean: A Cosmopolitan Foraminifera Species
G. bulloides has a wide temperature tolerance (2.8–29.6 °C, Bè, 1977; Hemleben et al., 1989, Figure 2), so the seasonal and annual models perform effectively the same for this particular species; this is evident in the similar posterior distributions of β and βi (Figure 6). To investigate model performance on G. bulloides, we used our hierarchical annual calibration, though results are nearly identical for the seasonal calibration. As a test case, we applied the calibration to a G. bulloides
time series from core MD95-2043, spanning from 52 ka to present in the eastern Alboran Sea in the Mediterranean (Cacho et al., 1999). We used a modern
value of 1.05‰ (VSMOW; LeGrande & Schmidt, 2006) . Resulting SSTs are compared to an alkenone
reconstruction from the same site, calibrated with BAYSPLINE (Tierney & Tingley, 2018).
Our model-based G. bulloides reconstruction matches the mean and variability of the alkenone-based reconstruction remarkably well (Figure 12a), though the alkenone posterior is noticeably tighter (
°C) than for G. bulloides (
°C). However, the BAYSPLINE calibration explicitly calibrates alkenone data in the Mediterranean to November–May SSTs (Tierney & Tingley, 2018), and indeed the late Holocene values approach the modern November–May value of 16.5 °C. The median predicted values from G. bulloides are cooler than the present-day annual range and generally follow the alkenones. This may reflect a slight cool-seasonal bias for G. bulloides, which has peak abundance in the winter and spring in the Western Mediterranean (Bárcena et al., 2004; Rigual-Hernández et al., 2012). However, the uncertainty bounds are large and the latest Holocene estimates do encompass modern mean annual SSTs (18.8 °C; Boyer et al., 2013). Within calibration uncertainty, the model reconstruction using G. bulloides reasonably estimates annual SST.


4.2 High-Latitude Settings: Cool-Temperature Foraminifera
To investigate the performance of our model in a cold-water setting (Figure 12b), we apply our annual hierarchical calibration to a deglacial-to-Holocene N. pachyderma
record from core EW0408-85JC on the Gulf of Alaska (Davies et al., 2011; Davies-Walczak et al., 2014; Praetorius et al., 2015). The reconstruction uses modern
‰ (VSMOW; LeGrande & Schmidt, 2006) . The results are similar if we used the seasonal hierarchical model (not shown). The N. pachyderma record shows similar changes through time as
but is offset from the latter by appropximately 2–3 °C. At this location, the
calibration BAYSPLINE assumes a June–August bias and explicitly predicts summer temperatures; thus, the warm offset is expected and primarily reflects differences in the seasonal production. The N. pachyderma reconstruction is slightly cooler on average than the modern annual SST range. N. pachyderma occupies a broad depth habitat range in the North Pacific (Kuroyanagi et al., 2011), and so the slightly cooler SSTs predicted may reflect the offset between sea surface and actual N. pachyderma habitat depth, which we do not explicitly account for with our calibrations. Additionally, N. pachyderma tends to have peak abundance in the spring and late winter in the Gulf of Alaska (Sautter & Thunell, 1989), which has modern SSTs closer to the late-Holocene predictions. Regardless, the reconstructed Holocene SSTs fall within uncertainty of present-day mean annual values, suggesting that our model is reasonably accurate at estimating annual SST from N. pachyderma
at this location.
4.3 Annual Versus Seasonal Calibration
To explore the impact of accounting for seasonality, we apply both the annual and seasonal hierarchical models to a G. ruber record from core VM21-30 on the eastern equatorial Pacific spanning the last 30 ka (Koutavas & Sachs, 2008; Koutavas & Joanides, 2012). At this site, the effect of model selection is notable, with the annual model resulting in overall cooler temperatures and a larger range of variability (Figure 12c) than the seasonal model (Figure 12d). In both cases the we used a modern
of 0.24‰ (VSMOW; LeGrande & Schmidt, 2006) . The annual calibration and an alkenone-based reconstruction have similar uncertainties (both
°C; Figure 12c). The seasonal calibration results in temperatures that agree more closely with the alkenone-inferred SSTs, both in the mean and magnitude of reconstructed trends (Figure 12d), as well as producing a tighter reconstruction (
°C). Observations of G. ruber in the tropical Pacific show peak abundance for warm SSTs (Thunell et al., 1983), and our flux-based estimates suggest that G. ruber at this site should be seasonally biased toward December through May, the time of year when SSTs are at their warmest (Fiedler & Talley, 2006). The fact that the alkenone reconstruction best matches our seasonal G. ruber predictions (Figure 12d) suggests that the
record could also be biased toward warmer months rather than recording annual SSTs. However, there is no indication of warm bias in the eastern equatorial Pacific core top alkenone data (Kienast et al., 2012; Tierney & Tingley, 2018), and the
predictions during the Holocene still overlap with the range of annual SSTs at this site.
4.4 Influence of Freshwater Input and Changes in

To see the impact of freshwater inputs on a
-based temperature reconstruction, we apply our calibration to a G. ruber record from core GeoB 6518-1 (Schefuß et al., 2005) on the west coast of Africa in the Gulf of Guinea, near the mouth of the Congo River (Figure 13a). The core spans the last deglaciation (20 ka to present), which saw dramatic changes to the central African hydroclimate (Gasse, 2000; Schefuß et al., 2005). Alkenone
data, which are not affected by changing
or freshwater input, are available from this site for comparison with
(Schefuß et al., 2005). As in earlier examples, we apply our hierarchical seasonal calibration to the G. ruber
record and use BAYSPLINE (Tierney & Tingley, 2018) to reconstruct temperature from the
record. The estimated modern
for the site is 0.52‰ (VSMOW; LeGrande & Schmidt, 2006) , and our estimated peak seasonal growth for G. ruber at this site is from September to June. The reconstructions from
and
are relatively similar in the Late Holocene, the Younger Dryas (∼12 ka), and the Last Glacial Maximum. Outside of these periods, we see the foraminifera-based reconstruction diverge from the alkenone-based reconstruction, predicting warmer temperatures. These periods (the Early Holocene and Bølling-Allerød) correspond to times of larger freshwater inputs from increased precipitation across the extensive Congo River basin (Schefuß et al., 2005). This application demonstrates that, in coastal regions with large freshwater inputs,
will be biased toward more negative values, even when our new calibration models are applied.







4.5 Gulf of Mexico Sediment Traps: Predicting Monthly
of G. ruber
In this example we test how well our calibration can replicate the seasonality of monthly
measured from G. ruber (white and pink) at a sediment trap site, where SSTs and
values are well constrained (Figure 13b). The site is in the northern Gulf of Mexico and has repeated
measurements for each month from 2010 to 2013 (Richey et al., 2019). These years are pooled into monthly mean
values. We use the hierarchical annual calibration and not the seasonal variant because the seasonal calibrations favor SSTs associated with high foraminifera abundance and so tend to be better at predicting peak abundance seasons rather than growth for each month of the year. We use HadISST (Rayner et al., 2003) monthly mean SSTs from the nearest grid point to the sediment trap site, as Richey et al. (2019) found that HadISST reasonably replicated the seasonality of SSTs when compared with measurements from a local buoy. We create two predictions each with
values of 1.40‰ (VSMOW) and 0.86‰ (VSMOW), following the measured mean
of 1.13‰ and annual range of 0.53‰ between 0 and 50m depth (Richey et al., 2019). We lag our calibration predictions by 1 month to account for the time needed for foraminiferal tests to settle in the water column. Our calibration replicates the observed
seasonal pattern remarkably well (Figure 13b). The closest predictions are from the model using
‰. The absolute difference of its monthly prediction and observation means for G. ruber(pink) is between 0.02‰ and 0.24‰ and between 0.03‰ and 0.66‰ for G. ruber(white). This is an encouraging result given that the full prediction distribution has
‰ and the additional spread in the sediment trap observations can be as high as σ= 0.47‰ (see Figure 13b).
4.6 Application to Deep-Time Paleoclimate Reconstructions
Data from Late Quaternary sediment cores benefit from application of our hierarchical models, but these models cannot be reliably applied to non-extant planktic species in deeper geological time. Here, our annual pooled model is a more appropriate choice, and this was one of our primary motivations for developing this calibration model. In the pooled model, all species are assumed to calcify similarly, which enables us to approximate a general “planktic dependency” of
on annual SSTs. Arguably, this is the best first-order approach for applications to non-extant species, for which species-specific information such as seasonality, depth habitat, presence of symbionts, and crust formation are poorly constrained.
As a demonstration, we apply our pooled model to a
record of the planktic Morozovella spp. from the Paleocene-Eocene Thermal Maximum marine section from Bass River, New Jersey (John et al., 2008) (Figure 13). These specimens are “glassy” (well preserved) and thus are unlikely to show any of the isotopic overprinting common in “frosty” deep-sea specimens from this period, which can generate anomalously cold SST reconstructions from simplistic interpretation of
measurements (Kozdon et al., 2011). We show reconstructions for
‰, which are plausible values for this site (John et al., 2008), bounding the modeled value of 0.0‰ from Tindall et al. (2010). We compare the results to a TEX86-based SST reconstruction from the same section (Sluijs et al., 2007). The TEX86 data are calibrated to SSTs using the BAYSPAR analogue method as described in Tierney and Tingley (2014). We use a search tolerance of 0.15, resulting in 11 modern analog grid points. We use a wide, weakly informative prior for SST (Gaussian, 30.0 °C mean and 20.0 °C standard deviation) for both the TEX86 and foraminiferal data.
The inferred SSTs from the Morozovella spp. and TEX86 data generally agree with one another and overlap within uncertainties, though there is increased separation just below 355 m, near the onset of the Paleocene-Eocene Thermal Maximum (Figure 13c). It is tempting to interpret this as temporary differentiation of depth habitat (e.g., Morozovella spp. migrating deeper in the water column and/or TEX86 producers forced to the surface), but the large uncertainties on both SST reconstructions must also be considered. We note that the average standard deviation of the foraminiferal reconstruction (
°C) is much smaller than the uncertainty on the TEX86 estimates (
°C). This can be attributed to the fact that the TEX86 values for this sequence (which approach 0.9) are far outside the range of the calibration data set—which contains only a few values above 0.75—requiring heavy extrapolation. Another important factor is that overall, the temperature sensitivity of TEX86 is poorly constrained relative to β in our
calibrations. Indeed, β is well estimated by our pooled model and remarkably similar to thermodynamic expectation (annual pooled β posterior 95% CI is between −0.234 and −0.228) for modern species. That said, the sensitivity of the
proxy to
must be considered, where a difference of 1.0‰ corresponds to approximately 4.1 °C (Figure 13c). In an unglaciated Eocene world at 55 Ma, global
can be fairly well estimated, but poorly constrained local variations in
are another potential source of uncertainty to interpreting
. This deep-time example underlines the importance of constraining variability and quantifying uncertainty in
when using
to reconstruct SST.
Overall, these examples show that our calibration models produce reasonable results when applied to foraminiferal records of
. Our SST reconstructions compare favorably with independent SST reconstructions using biomarker-based proxies in a variety of paleoceanographic settings and using different planktic species. We are also able to predict subannual
seasonality found in sediment trap measurements. Importantly, these examples demonstrate how our models incorporate multiple sources of uncertainty in reconstructing SST, which is often not reported or considered in published downcore records and subsequent interpretations. This fundamental uncertainty in SST reconstructions, which is inherent in any paleo-proxy, needs to be considered when comparing records, integrating SST reconstructions across geographic regions or comparing data to climate model output.
4.7 bayfox: Bayesian Foraminifera Calibration Software
Our calibration models are available to users as a software library called bayfox. This library is packaged and available in both Python (https://github.com/brews/bayfox) and R (https://github.com/brews/bayfoxr). Scripts are also available for MATLAB/Octave (https://github.com/brews/bayfoxm). These packages include both forward and reverse calibration models so that users can infer
from SSTs and infer SSTs from
. The software is available under an Open Source license.
5 Conclusions
Our Bayesian calibration models enhance the widespread use of planktic
in paleoceanography by providing a realistic representation of proxy uncertainty based on core top variability. We find that, in spite of the many biological and environmental factors that can influence planktic
, the inferred sensitivity to SST is remarkably similar to established inorganic calcite calibration curves, attesting to the fidelity of
. Our annual, seasonal, and species-specific calibration exercises demonstrate that model performance is improved by accounting for foraminiferal seasonal abundance and species-specific variability in calibration parameters. However, some residual patterns remain and can be difficult to diagnose due to the complex set of environmental factors which control the abundance of each foraminiferal species, as well as uncertainty in observed variables–specifically
. We demonstrate how the calibration can be used to reconstruct SST in the Late Quaternary, where generally speaking, the most applicable model is the hierarchical seasonal model. We demonstrate how the calibration can replicate the seasonal signal of
observed in Gulf of Mexico sediment traps. We also demonstrate how our pooled annual model can be used to infer SSTs from
of non-extant species of planktic foraminifera in deeper geological time. We have made the calibration models available in Open Source software libraries (bayfox), so that users can apply these calibrations to both forward and inverse
modeling problems.
Acknowledgments
This research was funded by the Heising-Simons Foundation (2016-015) and the National Science Foundation (AGS-1602156). We thank our Editors and anonymous reviewers for their time and thoughtful comments. Thanks to Kaustubh Thirumalai for help collecting sediment trap data. Core top data used for this analysis are available as supporting information. Open Source Software packages implementing these calibrations are available for Python (https://github.com/brews/bayfox) and the R statistical environment (https://github.com/brews/bayfoxr). Scripts are also available for MATLAB/Octave (https://github.com/brews/bayfoxm).
Appendix A: Bayesian Regression Model Design and Priors

The normal priors around α and β loosely reflect existing inorganic precipitation and culture-derived calibrations from the literature (see Bemis et al., 1998). However, we found that less informed α and β prior distributions, for example,
, produced comparable results.
With this design, updates to individual species parameters also inform shared hyperparameters. At the same time, hyperparameters influence the individual species parameters.
We infer the posterior distribution of the models with a No-U-Turn Sampler (Hoffman & Gelman, 2014)–an MCMC sampler variant using Hamiltonian mechanics. We initialized No-U-Turn Sampler using an identity mass matrix with a diagonal adapted to the variance of the sampler tuning steps. The start of each sampling chain was the prior mean with added uniform noise (between −1 and 1). The sampler was run in two chains each with 1,000 tuning draws followed by 5,000 draws. We compared the chains for convergence and autocorrelation. Tuning steps were removed from the draws, and the chains were combined for analysis.




Appendix B: Ice Volume Correction









Software implementing this ice volume correction is available for broader use in the Open Source Python package erebusfall (https://github.com/brews/erebusfall).