Volume 15, Issue 12 e2023MS003899
Research Article
Open Access

A Variational LSTM Emulator of Sea Level Contribution From the Antarctic Ice Sheet

Peter Van Katwyk

Corresponding Author

Peter Van Katwyk

Department of Earth, Environmental, and Planetary Sciences, Brown University, Providence, RI, USA

Data Science Institute, Brown University, Providence, RI, USA

Institute at Brown for Environment and Society, Brown University, Providence, RI, USA

Correspondence to:

P. Van Katwyk,

[email protected]

Contribution: Conceptualization, Methodology, Software, Validation, Formal analysis, ​Investigation, Resources, Data curation, Writing - original draft, Writing - review & editing, Visualization, Project administration, Funding acquisition

Search for more papers by this author
Baylor Fox-Kemper

Baylor Fox-Kemper

Department of Earth, Environmental, and Planetary Sciences, Brown University, Providence, RI, USA

Institute at Brown for Environment and Society, Brown University, Providence, RI, USA

Contribution: Conceptualization, Methodology, Validation, ​Investigation, Resources, Writing - original draft, Writing - review & editing, Supervision

Search for more papers by this author
Hélène Seroussi

Hélène Seroussi

Thayer School of Engineering, Dartmouth College, Hanover, NH, USA

Contribution: Validation, Resources, Data curation, Writing - review & editing

Search for more papers by this author
Sophie Nowicki

Sophie Nowicki

Department of Geology, University at Buffalo, Buffalo, NY, USA

Contribution: Validation, Resources, Data curation, Writing - review & editing

Search for more papers by this author
Karianne J. Bergen

Karianne J. Bergen

Department of Earth, Environmental, and Planetary Sciences, Brown University, Providence, RI, USA

Data Science Institute, Brown University, Providence, RI, USA

Contribution: Conceptualization, Methodology, Software, Validation, ​Investigation, Data curation, Writing - original draft, Writing - review & editing, Visualization, Supervision, Project administration

Search for more papers by this author
First published: 23 December 2023
Citations: 1

Abstract

The Antarctic ice sheet (AIS) will be a dominant contributor to global mean sea level rise in the 21st century but remains a major source of uncertainty. The Ice Sheet Model Intercomparison for CMIP6 (ISMIP6) is an ensemble of continental-scale models for studying the evolution of the AIS and projecting its future contribution to sea level. Due to their complexity and computational cost, ISMIP6 simulations are sparse and generated infrequently. Emulators are smaller-scale models that approximate ISMs and enable experimentation and exploration into the drivers of sea level change. We introduce a neural network (NN) emulator to approximate the ISMIP6 ensemble, using a variational Long Short-Term Memory (LSTM) with Monte Carlo dropout to quantify single-projection uncertainty. The proposed NN emulator is compared to a Gaussian Process (GP) emulator on four criteria: accuracy of point estimates and predictive distributions of individual model projections, approximation of the ensemble projections, and model training time. The NN predicts more accurately on single projections, with a mean absolute error of 0.46 mm Sea Level Equivalent (SLE) versus 0.73 mm SLE for the GP, and has more accurate uncertainty estimates. The NN emulator also better approximates the ensemble distribution of ISMIP6 model projections, with a Kullback-Leibler divergence of 18.26 versus 199.14 for GP at the projection year 2100. The NN enables more accurate experimentation with a reduced runtime, offering a new tool for understanding the important role of regional precipitation, ice sheet drainage systems, and interannual and longer timescale dynamics.

Key Points

  • A neural network (NN)-based emulator of Antarctic ice sheet mass loss over the 21st century is introduced

  • The proposed emulator is fast to train, handles large inputs, models temporal correlations and quantifies uncertainty

  • The proposed NN emulator outperforms a standard Gaussian Process emulator on a range of metrics, including computational cost and accuracy

Plain Language Summary

An emulator is a computational model that is designed to rapidly approximate another more complex, computationally intensive model. In this work, we propose a new emulator of the Antarctic ice sheet that provides faster and more accurate projections of the future sea level rise due to the melting of the ice sheet. This new emulator uses a technology called neural networks (NNs), a common tool within the field of artificial intelligence. We show that the NN-based emulator produces accurate sea level projections while also being able to quantify the inherent uncertainty associated with future sea level rise. With these improvements, this emulator will enable scientists to better investigate the ice sheet itself and learn how it connects to Earth's dynamic behavior. In addition, due to the efficiency of the proposed emulator, not only are many avenues of research opened for further improvements of ice sheet emulators, but also a wider range of scientists will be able to contribute to future development. This will lead to a better understanding of the drivers of sea level rise as well as more informed decision-making and mitigation efforts.

1 Introduction

Global sea level is projected to rise up to 1 m by 2100 and a contribution approaching 2 m cannot be ruled out due to deep uncertainty in ice-sheet processes (Fox-Kemper et al., 2021). Mass loss of the Antarctic ice sheet (AIS), which was a dominant contributor to global mean sea level rise during the last two decades, constitutes the largest uncertainty in future sea level projections, particularly beyond 2050 (Kopp et al., 20172023). To address this prevailing uncertainty, the Ice Sheet Model Intercomparison Project (ISMIP6), an intercomparison of ice sheet models (ISMs) endorsed by the Climate Model Intercomparison Project-phase 6 (CMIP6) and driven with the output of those global climate models, attempts to simulate the evolution of polar ice sheets in response to the Earth's dynamic, inter-connected climate systems (Nowicki et al., 2016). However, due to the complexity of the models and the computational resources and runtimes required, the ISM simulations are generated infrequently. As a result, sensitivity testing (Aschwanden & Brinkerhoff, 2022; Garbe et al., 2020; Golledge et al., 2017), a crucial tool for understanding the relationship between ice sheet conditions (i.e., climate system forcing) and sea level contribution, becomes impractical. Furthermore, numerous scenarios needed for the latest IPCC assessment were not simulated in the first round of ISMIP6 projections, so emulators were used to find the expected sensitivity of the ice sheets to different emissions (Edwards et al., 2021; Fox-Kemper et al., 2021).

Emulators are smaller-scale models that are designed to approximate the mapping of inputs to outputs of more complex models (Reichstein et al., 2019). They are starting to be used to approximate ISMs because they enable experimentation and increase accessibility of the large-scale models (Edwards et al., 2021). Emulators can be based on physical equations by focusing on a subset of components of the larger system (Nonnenmacher & Greenberg, 2021); other emulators are statistical or data-driven models that can produce simulations without access to a mechanistic model of the larger system. While emulation is common in other areas of climate science, such as weather (Chantry et al., 2021), ocean dynamics (Rueda et al., 2016), atmospheric sciences (Ukkonen, 2022), as well as many other applications within the Earth sciences (Camps-Valls et al., 2016; Reichstein et al., 2019), emulating ice sheet dynamics has not been as well explored.

Previous work has mostly focused on modeling the effect of individual or small subsets of forcings (variables affecting Earth's climate) on sea level contribution. Levermann et al. (2020) use empirical modeling to understand the uncertainty of future ice loss from Antarctica as it can be induced by basal ice shelf melting only. Model for the Assessment of Greenhouse-gas Induced Climate Change version 2.0 (MAGICC v2.0) (Nauels et al., 2017), a process-based sea level emulator, uses physical modeling with a larger quantity of inputs (e.g., thermal expansion, surface mass balance, solid ice discharge, land water storage) to assess both the effect of ice sheets on sea level rise and the mass loss from glaciers. Edwards et al. (2021) present the first statistical emulator using Gaussian Process (GP) Regression of AIS-induced sea level rise (as well as from the Greenland ice sheet and other mountain glaciers) based on global mean surface temperature, an ice-collapse parameter, and a sub-shelf basal melt parameter. The Edwards et al. (2021) and the Levermann et al. (2020) emulators were used extensively in the latest IPCC cryosphere and sea level change assessment for both supplementing missing ISMIP6 experiments as well as providing independent projections based on ice sheet forcing data (Fox-Kemper et al., 2021).

GP regressors (Rasmussen et al., 2006) have been a frequent choice for Earth emulator architecture because of their ability to quantify uncertainty (Camps-Valls et al., 2019). However, the time required to train a GP naively scales cubically with the number of samples drawn from the emulated system, resulting in limited ability to model high-dimensional data with a large number of sampled data. Following emulator practice, these samples will be called observations although they are in fact drawn from physical models being emulated rather than real-world observations. Climate simulation data, ISM data in particular, is inherently high-dimensional and contains a large number of observations, so using a GP requires the severe restriction of input data set size to be practical to use. The number of observations is typically reduced by aggregating observations to decrease resolution. In addition, GP models are often trained on fewer input parameters, however the dimensionality of input data adds less of a computational constraint on model training than the number of observations (Liu et al., 2020). GP models are effective in modeling the relationship between a subset of input forcings and sea level contribution, but due to data set restriction, there may be forcings or interactions between forcings that are overlooked. Therefore without the use of the full range of available forcings, the accuracy of the emulator will decrease and sensitivity tests performed will be less reliable.

Neural networks (NN) are increasingly becoming an emulator of choice due to their ability to efficiently model large, high-dimensional data sets (Erfani et al., 2016). A drawback is that NNs typically output a point estimate and do not estimate prediction uncertainty. Bayesian deep learning provides tools for uncertainty quantification that can be integrated into NN architectures (Gal & Ghahramani, 2016a).

In this study, we propose a variational NN-based emulator that models sea level contribution induced by the evolution of the AIS and uses Monte Carlo dropout (MCD) to quantify uncertainty. The goal of this emulator is to accurately approximate the ISMIP6 ensemble, which will enable efficient experimentation and sensitivity testing with ISMs while also enabling the accurate and fast supplementation of missing ISMIP6 experiments. Our results show that compared with a GP emulator, the proposed emulator predicts more accurately individual ISMIP6 85-year sea level projections, while also approximating more accurately the distribution of the entire ISMIP6 ensemble. Additionally, the proposed emulator maintains the ability to quantify uncertainty by producing predictive distributions rather than point estimates. Using a NN also enables a dramatic speed increase during training as well as increased portability for future use.

2 Data

2.1 ISMIP6 Data

We use data from the Ice Sheet Model Intercomparison Project for CMIP6 (ISMIP6), which includes the collaboration of 13 modeling groups and 21 submitted sets of ice sheet simulations (Seroussi et al., 2020). ISMIP6 contains the forecasts for both the AIS and the Greenland Ice Sheet (Goelzer et al., 2020; Payne et al., 2021), but only the AIS is considered in this study. Following ISMIP6, six CMIP5 Atmosphere-Ocean General Circulation Models (AOGCM) (CCSM4, MIROC-ESM-CHEM, NorESM1-M, CSIRO-Mk3-6-0, HadGEM2-ES, and ISPL-CM5A-M) at two different Representative Concentration Pathways (RCP 2.6 and RCP 8.5) are included. These models were chosen based on their ability to accurately capture present-day conditions close to the AIS and provide diversity in projected changes (Barthel et al., 2020; Nowicki et al., 2020). CMIP6 model outputs were not available at the time of experimental protocol, so CMIP5 model outputs were used instead. Output forecasts from 2016 to 2100 from the six AOGCMs are used as inputs to the large-scale ISMs included in ISMIP6. These models provide insights into climate systems based on physical, chemical, and biological properties of individual climate components as well as their interactions. For more information on the ISMs, configurations, AOGCMs, and simulations, see Seroussi et al. (2020).

The AOGCM outputs are used to provide atmospheric, oceanic, and ice shelf collapse forcings. Atmospheric forcing includes yearly averaged surface mass balance anomalies (combining precipitation, evaporation, sublimation, and runoff) and surface temperature anomalies. Oceanic forcing includes the temperature, salinity, and thermal forcing of the ocean surrounding the AIS. Areas of the AIS that currently are not in contact with ocean (but may be in the future) contain oceanic forcing that has been extrapolated using methods described in Jourdain et al. (2020).

Each modeling group ran their ISM for an ensemble of simulations, each time with provided forcing values, denoted “experiments.” For more information on the experiments and initial conditions, see Nowicki et al. (2020) and Table 1 of Seroussi et al. (2020). The results of the ISM simulations for each experiment were reported for each of the 18 sectors of the AIS (Figure 1). Each modeling group also provided a “control” experiment in which the ISM was run with constant climate conditions (no evolving forcings) based on the past couple of decades. The difference between the control experiment and the resulting outputs was computed to remove the impact of model drift and the effect of differing initialization methods (Nowicki et al., 2020). Therefore, as was done in ISMIP6 (Seroussi et al., 2020), the results of this study are to be interpreted as the simulated response to additional climate change compared to current conditions.

Details are in the caption following the image

Map of the 18 regions of the Antarctic Ice Sheet (AIS), as established by the Ice Sheet Model Intercomparison for CMIP6 protocol (Seroussi et al., 2020). Each of the 18 AIS regions is characterized by distinct atmospheric and oceanic forcing data.

2.2 Training and Test Data

The original forcings, which include atmospheric, prescribed ice shelf collapse mask, and oceanic forcings, are measured yearly over 2016–2100. We use atmospheric forcings that were regridded to 8 km to match the grid of the oceanic forcings, which we averaged over a third depth dimension z (−30 to −1,770 m). Specific ice shelf collapse forcings are omitted in this study due to the large proportion of experiments that exclude it (see Seroussi et al., 2020, Table 1) and to decrease emulator complexity. Previous studies partitioned the AIS into three distinct regions, West Antarctica, East Antarctica, and the Antarctic Peninsula (Edwards et al., 2021). We keep the data resolution closer to the native resolution and aggregate the forcing data over the grid cells within the 18 ISMIP6 sectors representing the basic drainage areas and dynamical flow patterns of the ice sheet, taking the unweighted mean forcing values of all the grid cells within each sector (Figure 1). Maintaining smaller subdivisions allows for spatial information about individual regions to be passed through the emulator and avoids information loss through the aggregation of a broader area, but handling this higher resolution requires efficient training and emulation.

The ISM simulation outputs, which were already aggregated by sector, were then paired with the mean atmospheric and oceanic forcing values by sector for each individual ISM. Simulation outputs include ice area, ice volume above flotation, spatially integrated surface mass balance, and others (see Seroussi et al., 2020). Sea Level Equivalent (SLE) is calculated using the ice volume above flotation (m3), for which we assume that 362.5 Gt of ice contributes 1 mm of SLE (Fox-Kemper et al., 2021). Initial conditions and model settings for each experiment (from Table 1, Seroussi et al. (2020)) were encoded as inputs to the emulator. Projections that terminate below the bottom 0.005 quantile of SLE and above the top 0.005 quantile of SLE were deemed outliers and dropped in order to prioritize creating an emulator that approximates the entire ensemble of ISM predictions rather than individual ISM accuracy. Categorical features, such as the ISM, AOGCM, and RCP scenario were then encoded by creating indicator variables with boolean flags. Due to the large range of scales, the strictly positive climate forcings were scaled between 0 and 1, and forcings that include both positive and negative values were standardized by subtracting the mean and scaling to a unit variance.

We compute lag features of time-dependent forcings to enable the emulator to learn from past observations. Each observation contains the current state of climate forcings as well as those from the previous 5 years, meaning that a given observation will have five additional sets of time-dependent forcings that correspond to the previous 5 years. We do not include the target of prediction, SLE, as a lag feature because the purpose of the study is to create an emulator with the same inputs as an ISM simulation scenario. Lag features are not always necessary when using recurrent architectures because of the ability of the network to handle sequences, but studies have shown that they can improve accuracy by weighting each feature individually (Surakhi et al., 2021). We found that including lag features improved model performance at little computational cost. We selected a look-back window of 5 years through systematic testing, weighing the trade-off between incorporating more signal into the data and unnecessarily inflating the data set (Text S1 in Supporting Information S1).

Each observation in the final data set contains the following features: forcings from the target projection year, the forcing values from the previous five years for all time-dependent forcings, and the static initial conditions, representing a total of 99 input features (see Table S1 in Supporting Information S1). The prediction target of the emulator is SLE (in mm) at the target projection year. The processed data were split for training and testing by randomly assigning entire 85-year time series into either the training or testing sets; 30% of the combinations of ISM, experiment, and sector were randomly assigned to the testing set and the other 70% to the training set. The training set was then split further using a similar procedure to create the validation set for hyperparameter tuning. The training set consists of 1,734 series, each spanning 85 years (a total of 147,360 observations), while the testing set has 734 series of the same length.

3 Methods

3.1 Time-Dependent Neural Networks

Deep learning (LeCun et al., 2015), a class of machine learning methods using multi-layer NN architectures, provides tools for fast and accurate approximation of complex Earth systems (Bergen et al., 2019; Lary et al., 2016). Because of the ability to handle large data sets, NNs are commonly used for processing images, text and other high-dimensional data types. Earth processes are likewise inherently high-dimensional, including multiple complex systems varying over both time and space (Karpatne et al., 2018). Depending on the parameterizations of the underlying climate models, climate emulation data sets can have hundreds of variables and hundreds of thousands of observations, which makes NN architecture a natural choice for modeling climate data.

Recurrent neural networks (RNNs) are a class of networks designed for processing sequential data. RNNs can effectively model time-dependent data using an architecture that retains artificial memory of previously seen observations when making new predictions. Long Short-Term Memory (LSTM) networks are a type of RNN that uses states to control the flow of information through the network (Hochreiter & Schmidhuber, 1997). Figure 2 shows a diagram of an LSTM operation, or cell. Horizontal lines show the persistence of information from previous observations in a sequence. This information is stored in cell state c and hidden state h and is passed from each cell throughout the sequence, thus creating an artificial memory of previous time-steps in the series. The vertical operations that occur within the LSTM, known as gates, determine which information persists and is passed on to the next prediction in the sequence. The first operation, the forget gate, computes a number between 0 and 1, denoting how much of the information from the previous state should be kept. The input gate then takes the result of the forget gate and the input, and decides which new information to add to the current state. Finally, the output gate takes the previous state and input, and determines the prediction, or output, of the LSTM cell. These gates enable the memory of long-term and short-term dependencies throughout the sequence. Thus, an LSTM-based architecture is better suited to emulate complex time-influenced processes by retaining artificial memory over prediction intervals.

Details are in the caption following the image

Neural network (NN) emulator architecture and Long Short-Term Memory (LSTM) cell. The blocks at the bottom show the layers in the NN architecture, with the layer shapes denoted as numbers within each block. The LSTM cell (upper left) shows the operations being performed within the LSTM layer of the NN emulator, which controls the flow of temporal information through the network. The inputted information x flows through gates (outlined in dashed lines), where each gate performs operations that determine which information is forgotten, which new information is stored, and which information is passed on to the next state. This information is stored within a cell state c and hidden state h for each time step t. By doing so, the LSTM is able to create an artificial memory of past observations, thus enabling more accurate predictions of future values. Monte Carlo dropout (Gal & Ghahramani, 2016a) is implemented by adding a Bernoulli dropout (p = 0.2) after each layer. For more information on dropout, see Figure 3. Input data shape is represented as follows: sequence length (5), input dimension (99). The plot (upper right) shows the approximated posterior predictive distribution generated from 100 forwarded passes from the variational network, with the mean prediction m(x) (red) and the uncertainty interval (orange). Definitions for terms specific to the LSTM cell are included for reference.

3.2 Neural Network Uncertainty With Monte Carlo Dropout

As scientific fields increasingly rely on simulations and machine learning to advance research, it is critical to understand the sources of uncertainty in these models and the data fed into them. For example, due to the natural variability within interconnected systems, scientists need to be able to distinguish between error and uncertainty in Earth systems models (Fox-Kemper et al., 2021). Uncertainty quantification (Abdar et al., 2021; Ghanem et al., 2016) provides a framework for estimating and characterizing the uncertainty in mathematical models, numerical simulations, machine learning algorithms, and data. There are many possible sources of uncertainty in mathematical models. Uncertainty can be broadly categorized as uncertainty due to intrinsic randomness in outcomes due to random effects (such as the chaotic contributions of weather), referred to as aleatoric or irreducible uncertainty, and uncertainty due to a lack of data or knowledge necessary to create the ideal model, referred to as epistemic uncertainty (Gawlikowski et al., 2021; Hüllermeier & Waegeman, 2021). Epistemic uncertainty, which includes uncertainty in the form or parameters of the ideal model, can be reduced with better information, such as more training data in the case of supervised learning models.

Projects like ISMIP6 use multi-member ensembles to capture uncertainties in projected climate impacts due to structural differences between models, differences in initial conditions, parameters, emissions scenarios, and other factors. Emulators learn to approximate these simulations using existing model runs as training data, but are subject to model uncertainty, for example, due to limitations in the data used for model training or model misspecification. By “model uncertainty” here we do not intend to imply the “model bias” typical of climate models failing to produce the climate system, we reserve the term model uncertainty to mean the NN model failing to represent the ISMs. A favorable characteristic of a GP (our baseline emulator, Section 3.4) is that it also provides an estimate of model uncertainty through the Bayesian posterior predictive distribution. The GP will produce a distribution of functions that fit the data, and the standard deviation of that distribution represents the total variability of the predictive functions, which is equivalent to the model uncertainty (Rasmussen et al., 2006). By adding or subtracting a multiple of the standard deviation from the mean prediction, we calculate the upper and lower limit of an uncertainty interval. Within the scope of this study, we define the 1 − α uncertainty interval as an interval within which the true value will fall with probability 1 − α. We use 95% (i.e., ≃2σ) for the uncertainty level in this study (α = 0.05), meaning that the emulated uncertainty intervals from both the GP and NN can be interpreted as the range within which the true ISM value will lie with a 95% probability.

Traditionally, NNs are deterministic after training is complete, meaning only point estimates are returned on new data and an estimate of model uncertainty is not available. In order to create an output distribution, a Bayesian model, similar to the GP, can be employed. This study employs MCD, a probabilistic procedure to approximate a deep GP, and thus generate uncertainty intervals on individual predictions from the NN.

Dropout is an operation applied to network layers that temporarily removes, or “drops out,” random nodes and their connections throughout a NN (Srivastava et al., 2014). When used as a regularization technique, dropout is activated only during training to limit the model's reliance on any particular node when making a prediction. Gal and Ghahramani (2016a) show that by adding dropout after each network layer (Figure 3), each forward pass of the network essentially becomes a different prediction function (or model) due to the random configuration of active nodes. Thus by carrying out T forward passes each with a different random subset of dropped-out neurons for every test observation, the result is a distribution, or ensemble, of T predictions. It can be shown that this protocol, which is called MCD, is a Bayesian approximation of the posterior predictive distribution (Gal & Ghahramani, 2016a2016b). Then, much like a GP, the prediction is the mean of the approximated posterior predictive distribution. Let fθ be the trained NN with model parameters θ and dropout implemented after each layer. The mean prediction will be
m ( x ) = 1 T t = 1 T f θ t ( x ) , $m(x)=\frac{1}{T}\sum\limits _{t=1}^{T}{f}_{{\theta }_{t}}(x),$ (1)
where T is the total number of Monte Carlo iterations on the input x. Each individual function f θ t ${f}_{{\theta }_{t}}$ can be seen as a unique network, with unique configurations of model parameters θt formed from the original parameter set θ by dropping out a subset of nodes to effectively zero-out a subset of the model parameters (Gal & Ghahramani, 2016a). When T forward passes are performed, the prediction will be the sample mean of the resulting distribution f θ 1 ( x ) , f θ 2 ( x ) , , f θ T ( x ) $\left\{{f}_{{\theta }_{1}}(x),{f}_{{\theta }_{2}}(x),\text{\ldots },{f}_{{\theta }_{T}}(x)\right\}$ . The upper and lower limit of the uncertainty interval can be calculated using a traditional 2σ measurement, where σ is the sample standard deviation of the T predictions. For more information on the dropout operation, see Text S3 in Supporting Information S1.
Details are in the caption following the image

Schematic of Monte Carlo dropout. (top) Bernoulli dropout is applied to the neural network to remove a subset of nodes, creating a distinct network f θ t ${f}_{{\theta }_{t}}$ . In this schematic, dropped-out nodes are shown in gray with black crosses through them. The collection of T forward passes creates an ensemble of network predictions y ^ 1 , y ^ 2 , , y ^ T $\left\{{\hat{y}}_{1},{\hat{y}}_{2},\text{\ldots },{\hat{y}}_{T}\right\}$ , which approximates the posterior predictive distribution (bottom). From this ensemble, a mean prediction m(x), where x is the input to the emulator, and 2σ (95%) uncertainty interval are calculated.

MCD provides an intuitive and efficient implementation of a variational network, or a network that uses variational inference to approximate the posterior distribution over model parameters (Gal & Ghahramani, 2016b), without having to alter the training process drastically. To implement MCD, the dropout layers of the trained network should be enabled during the testing phase. Then, for each test observation, multiple network predictions should be made on the same data and the results compiled to create an ensemble of outputs for that observation. However, there are inefficiencies when each prediction requires multiple network forward passes (Gal & Ghahramani, 2016a), so between 30 and 100 Monte Carlo iterations are recommended. We use 100 iterations of MCD for each test observation to get a better approximation of the posterior predictive distribution.

3.3 Emulator Architecture

The emulator used in this study is an LSTM model with the added ability to quantify uncertainty through MCD. Multiple iterations of various configurations of model architecture and hyperparameters were carried out using a grid search method. The result of each iteration was logged and the model with the lowest mean squared error (MSE) loss on the validation data was then saved and used as the proposed emulator. Only after hyperparameter tuning was done, the test set used to calculate the performance metrics outside of a development setting. For more details on the architectures and hyperparameters tested, see Text S1 in Supporting Information S1.

The optimal emulator architecture includes a single LSTM unit with 512 nodes (see Figure 2). The output of the LSTM layer is passed to a 32-node linear layer and finally to the output layer. Dropout layers were added after each network layer (including the LSTM unit) for implementation of MCD (Figure 2). The Bernoulli dropout probability was set to 0.2 to enable accurate prediction while promoting variability in the Monte Carlo passes during testing. The emulator was implemented in PyTorch (Paszke et al., 2019) due to the flexibility and readability of the modules. The emulator was trained using a MSE for 100 epochs with a batch size of 256. The code for the network architecture, training, testing, and much more can be found within the ice sheet emulator code repository (see Data Availability Statement). Training was carried out using an NVIDIA QuadroRTX GPU—a single GPU was used to demonstrate the accessibility of the methodology to a wide range of scientists. The result is a single model that predicts the entire ensemble of ISM projections, with the added capability of quantifying uncertainty. Figure 4 gives an overview of the methodology presented in this study. Panel A highlights the processing and aggregation of the data into ISMIP6 sectors. Panel B details the emulator architecture, and panel C shows the model prediction for an individual projection, as well as how that projection fits into the prediction of the entire ISMIP6 ensemble.

Details are in the caption following the image

Overview of the methodology presented in this study. Panel (a) shows the aggregation of Antarctic Ice Sheet forcing data from 8 km grids to Ice Sheet Model Intercomparison for CMIP6 (ISMIP6) sectors. Panel (b) shows the emulator architecture with Monte Carlo dropout implemented for uncertainty quantification. Panel (c) presents the output of the model for both individual Sea Level Equivalent (SLE) projection (with uncertainty) as well as all ISMIP6 ice sheet model ensemble projections. The red line in the individual projection (left), denoting the predicted SLE, is highlighted in the emulated ensemble (right) to show how each individual projection fits into the ensemble. In addition to the uncertainty on the entire ensemble (blue dashed lines), each projection has an associated individual uncertainty (orange, left).

The goal of the proposed emulator is to accurately predict not only individual time series of projections but also to approximate the distribution of SLE predictions at any given year. With an emulator that can accurately approximate the ensemble of ISM predictions at any given year (e.g., capture the distribution of sea level contribution at year 2100), climate scientists will be able to quickly estimate the range of possible sea level changes before the ISMs are required to generate additional simulations. With the current configuration of model architecture and training procedure, we aim to also provide a fast, efficient, and portable alternative to GP emulators.

3.4 Baseline: Gaussian Process Regression

GPs have been the predominant climate emulator architecture because they are flexible, non-parametric, and provide uncertainty estimates (Edwards et al., 2019). GPs are Bayesian models that produce a distribution of functions that fit the data, resulting in the ability to easily calculate uncertainty. However, training a GP requires the calculation of the inverse and determinant of a kernel matrix which has a cubic time complexity and a quadratic space complexity, which makes working with large data sets nearly intractable (Liu et al., 2020). For this purpose, GP-based climate emulators have been limited to small data sets, only selecting a sample of observations that is most influential on the target variable. This results in lesser predictive power than would otherwise be attainable because of the loss of information by data set reduction.

GPs are often used to model time series data because of the ability to incorporate prior information that promotes time-dependent prediction structure, such as smoothness and periodicity (Corani et al., 2021). However, to incorporate the time component, data from the entire sequence must be input into the GP model which, in this case, is impractical due to the size of the data set and the inability of the GP to scale. Edwards et al. (2021) attempt to overcome this by creating 85 independent ISM emulators, one for each projection year. By doing so, only the observations from the specific year are used as inputs to each independent model, which significantly increases the training speed. However, any memory effects from the ISM physics that carries over from year to year is not directly emulated.

For comparison to the proposed NN emulator, a GP emulator is trained on a restricted version of the data set based on the model presented in Edwards et al. (2021). The variables used in the GP training data set include time series of surface air temperature and ocean salinity. We tested other configurations of input parameters but the previous two forcings produced the most accurate model. Independent GP models were trained for each projection year (2016–2100) and predictions were stored for analysis. By training 85 independent models, the input data set size to each model was 1,734 observations rather than 147,390. After training, the models were saved but prediction outside of the training loop is difficult because there is no protocol for accessing and predicting with the 85 trained models. Power exponential kernels were used with an optimal alpha (power) parameter of 2.0 (equivalent to the standard Radial Basis Function) as well as an added nugget kernel to model random variability induced by variables not present in the data set (Andrianakis & Challenor, 2012). Scikit-learn (Pedregosa et al., 2011) was used for training and testing models on a CPU with 8 cores and 32 GB RAM. For comparison we trained a version of the GP model with a GPU-accelerated implementation (GPyTorch, Gardner et al., 2018) on the same hardware used for the NN training. The implementation of the GP on accelerated hardware decreased the training time compared to the Scikit-learn implementation, taking 102.8 min. This speedup is significant but does not decrease training time drastically enough to be competitive with the NN training. Additionally, the quadratic space complexity persists even on accelerated hardware, so the memory required to run the GP model remains a severe limitation. The results reported in this study are from the Scikit-learn model implementation, which is more user-friendly. Following the procedure of Edwards et al. (2021), we applied a 3-year rolling average of GP model outputs for smoothing. The training time (in minutes) was determined by logging wall time before and after training was complete.

4 Results

4.1 Emulation Performance

The emulators were evaluated on the following criteria: individual projection accuracy, ensemble mean accuracy, year-by-year distribution similarity, efficacy of uncertainty quantification, as well as training and inference time. Individual projection performance, or the accuracy of the emulator to predict specific 85-year series of sea level projections, is evaluated on the test data set using MSE over the entire projection series. In addition, other metrics are calculated including mean absolute error (MAE), R2, and Continuous Ranked Probability Score (CRPS). MAE is included as an interpretable addition to MSE and R2 gives information on how much of the variance can be explained with the model parameters.

A summary of resulting metrics of the comparison of the proposed NN emulator and the baseline GP emulator can be found in Table 1. Results show that the NN-based emulator was more effective than the GP baseline in every metric evaluated in the study. The NN, on average, was able to predict within 0.46 mm SLE whereas the GP predicts within 0.73 mm SLE. This is likely because the NN can use a larger set of input variables, which enables the modeling of a wider range of factors as well as the interaction between forcings. Even with the much larger input size, the training of the NN was 17 times faster than the training of the GP emulator due to GPU support for NN training. Inference time of both emulators is comparable, with the GP taking 0.25 s per test projection and the NN taking 0.39 s. The NN does take longer due to the inference time increasing linearly with the number of Monte Carlo iterations, however, when making prediction over the entire test data set, the difference in inference time is almost negligible. The NN-based emulator's increase in accuracy can also be attributed to the ability to model the dependence of observations within a time series of projections. By incorporating this knowledge, predictions are more accurate, even in extreme cases as can be seen in Figure 5. For more information on emulation error separated by AIS sector, see Text S4 in Supporting Information S1. To quantitatively assess the emulators' ability to accurately model time-dependence between observations, the first derivative of the ISMIP6 projections was calculated and compared to the first derivative of the NN and GP emulated projections. The MSE of the NN-emulated first derivative is 0.011 versus 0.076 for the GP. Based on this metric, the NN performs 45.6% better than the GP at approximating the first derivative of the projections, thus more accurately modeling the time-dependence within sea level projections.

Table 1. Comparison of Proposed Variational Long Short-Term Memory Neural Network Emulator and Gaussian Process Emulator
Group Metric Neural network Gaussian process
Individual Prediction Accuracy Mean Squared Error (mm2 SLE) 1.011 2.070
Mean Absolute Error (mm SLE) 0.462 0.733
R2 0.778 0.544
Continuous Ranked Probability Score (CRPS) 0.385 0.607
Ensemble Distribution Accuracya KL Divergence 18.259 199.137
JS Divergence 0.074 0.241
Training Time Walltime (min) 11.8 (GPU) 102.8 (GPU)
118.5 (CPU)
  • Note. Metrics calculated on the 734 held-out 85-year time series in the test data set. SLE refers to the Sea Level Equivalent (in mm).
  • a Evaluated at projection year 2100.
Details are in the caption following the image

Example of two representative projections (RCP 8.5) from the IMAUICE1 ice sheet model from the IMAU modeling group (De Boer et al., 2014). Panel (a) shows a projection (green) with more extreme changes of sea level contribution, while Panel (b) shows a projection closer to zero. In these cases, both the neural network emulator (orange) and the Gaussian Process (GP) emulator (blue) follow the general trajectory of the projection. The NN-emulated projection, however, is consistently closer to the true value and more accurately approximates the shape of the projection. The GP-emulated projection captures the trajectory, but often is more conservative in its prediction, resulting in more emulated projections being close to zero (see Figure 7). GP predictions include 3-year smoothing.

To better understand whether incorporating the temporal structure of the data or using a broader range of inputs improves the model accuracy, separate models were trained and compared to isolate each of these variables (see Text S5 in Supporting Information S1). Models with the LSTM architectures and lag features were trained alongside independent models (removing temporal architecture and lag features) to isolate the effect of time-specific models on model accuracy. Results show that directly modeling the time domain with an LSTM clearly increases accuracy of the emulated ensemble mean compared to the true ISMIP6 ensemble mean. In addition, models with an increased number of input forcings (i.e., higher-resolution AOGCM data) create smoother and more accurate mean predictions. These experiments indicate that better physics—in particular a combination of temporal modeling and the use of a wider range of input forcings—contribute to the improvement of accuracy generated by the proposed NN emulator, not just algorithm distinctions between the NN and GP. While a GP may be a suitable architecture in scenarios with limited input forcings, a NN that incorporates the temporal structure of the full range of forcing data will result in better accuracy, a closer approximation of the ISMIP6 ensemble distribution, and generally more realistic ice sheet dynamics. The higher training efficiency of the NN (Table 1) is important in enabling the larger forcing data sets and temporal structure. For more information on the details of the experiments and results, see Text S4 in Supporting Information S1.

Errors in the NN emulator are generally dependent on year, spatial location, and ISM. Prediction MSE gets progressively higher as the projection year increases, as it does with the GP emulator, but at a much slower rate (Figure S4 in Supporting Information S1). This may be caused by the increase in the number of years since the last observation as well as the more extreme sea level projection values in later years. The emulator performs best on projections from the following ISMs: PISM1 and 2 (AWI), IMAUICE1 (IMAU), and ISSM (JPL) and it performs worst on fETISh (ULB), ISSM (UCIJPL), and PISM (VUW). Note that the emulator performance does not reflect the performance of the ISM, simply the ability to approximate the underlying functionality of those models. For more information on those models and their respective modeling groups, see Seroussi et al. (2020). Likewise, the emulator performs best in sectors 16 (near Larsen Ice Shelf), 13 (Enderby Land), and 12 (Oates Land), and worst in sectors 4 (Amundsen Sea Sector region), 2 (Siple Coast), and 8 (Aurora Basin; see Figure 1). Interpreting the underlying causes of these trends is outside the scope of this study, although it is notable that many regions already experiencing significant dynamic changes are the most difficult to emulate. Better understanding the underlying reasons may be an avenue for future research and collaboration within the community. For more information on error analysis, including complete tables and figures, see Text S4 in Supporting Information S1.

Figure 5 shows a representative example of two distinct simulations from the Institute for Marine and Atmospheric Research Utrecht (IMAU) ISM, IMAUICE1 (De Boer et al., 2014) for a more extreme final SLE value (panel A) and a final SLE value closer to 0 (panel B). Results show, as can be seen in the examples in both panels A and B, that while both emulators can capture the general shape of the projection, the NN consistently emulates the true projection value more closely than the GP. In addition, the GP relies heavily on the 3-year smoothing to capture the true trajectory—without the 3-year smoothing, GP predictions are erratic due to the lack of dependence on previous observations. While a time-average smoothing qualitatively appears to address this concern, the underlying inability to model the time domain still persists. To improve GP accuracy, several attributes of the model were changed, including choice of kernel function, kernel hyperparameters, and input variables, with little effect on the result. The use of more training observations was attempted but was quickly constrained due to the quadratic space complexity when training—over 250 GB RAM is required to train a GP model with the same number of training observations as the NN. While the hardware requirement of over 250 GB RAM is not outside the realm of feasibility on high-performance computing clusters, we highlight this as a limitation for many scientists and propose that comparing a GP emulator trained on the full set of training observations to the NN proposed in this study would be a valuable future avenue of research.

To measure the quality of the predictive distribution that is returned for each test projection (from which the uncertainty interval is calculated), we use the Continuous Ranked Probability Score (CRPS). This score compares a predictive distribution generated by a probabilistic or Bayesian model with a true, observed value (Gneiting & Raftery, 2007; Matheson & Winkler, 1976). Continuous Ranked Probability Score is commonly used for comparing and assessing the quality of predictive distributions generated by probabilistic or Bayesian models, particularly in weather forecasting (Grimit et al., 2006; Zamo & Naveau, 2018). The CRPS can be interpreted as a generalization of the MAE, with the MAE being a specific case of the CRPS where the predictive distribution is a single value (Hersbach, 2000). Similar to the MAE, a smaller CRPS indicates a better approximation of the expected predictive distribution. Results, which were calculated using the properscoring python package (Climate Corporation, 2022), show that the NN emulator produced a CRPS of 0.385 whereas the GP produced a CRPS of 0.607, showing that the NN is not only able to preserve the capability to quantify uncertainty from the predictive distribution, but the distributions it generates are more accurate. For more information on how the CRPS is calculated, see Text S6 in Supporting Information S1.

4.2 ISM Ensemble

A key purpose of the emulator is to not only predict the mean trajectory of the ISM ensemble but also to approximate the ensemble of sea level contribution predictions from the ISM simulations. For example, it may be useful to climate scientists to know both the best estimate (sum of mean prediction from all sectors) and the range of possible values as well as the related probabilities for sea level contribution at year 2100. Ensemble mean performance is evaluated by comparing all of the emulator predictions and true values (ISM simulations) over the entire data set. The IPCC Assessment Reports use the mean value for all ISM simulations or the mean of the emulator spread (Edwards et al., 2021; Levermann et al., 2020) for a given emissions scenario to calculate the central trajectory of the ice sheet contribution to sea level over time (Fox-Kemper et al., 2021). To capture this trajectory, we compare the mean value over the ensemble of ISM simulations with the mean value over the ensemble of emulator predictions. We then sum the mean values of sea level contribution from each sector to get the total AIS sea level contribution. Figure 6a compares the ensemble mean prediction of the ISM simulations and the emulated ensemble mean for the NN and GP emulators. The NN produces a relatively smooth prediction that follows closely the mean of ISM simulations. The GP mean generally captures the same trajectory, but is much more conservative (mean sea level prediction is reduced) due to the higher frequency of projection predictions being close to zero. This can be explained by the GP's reliance on the prior distribution, which in this case is a zero-mean prior following Edwards et al. (2021).

Details are in the caption following the image

Comparison of the true ice sheet model ensemble (green) with the NN- (orange) and GP- (blue) emulated ensembles, based on 734 time series in the test set. Panel (a) shows the comparison of mean sea level projection for each year from 2015 to 2100 (i.e., the sum of average contributions from each sector). Panel (b) compares the distribution of Sea Level Equivalent projections at the year 2100 for each of the 734 models. The probability density is computed from the projections using Gaussian kernel density estimation.

Along with the mean ensemble prediction, it is important to approximate the distribution of the ensemble at any given year to understand the range of possible projection values. Indeed, when uncertainty is high, a central estimate is often avoided and only a range for >66% confidence (likely) or >90% confidence (very likely) are assessed in the IPCC. Figure 7 shows the predictions of each emulator for each of the ISM projections. To analyze the approximation of distribution edges, the very likely range between the 5th and 95th percentile (between the blue dashes) is shown. As can be seen, the NN approximates very closely the true percentile, or very likely range, as established by the ISMIP6 ISMs. On average, the NN-emulated percentile bounds are within 0.36 mm SLE of the true percentile bounds over the entire 85-year projection. The GP percentile range, however, is much more conservative with many predictions around zero, which results in GP-emulated percentile bounds being within 1.26 mm SLE. Beyond just the extremes, year-by-year distribution similarity is evaluated using a Kullback-Leibler (KL) divergence (Kullback & Leibler, 1951) and a Jensen Shannon (JS) divergence (Lin, 1991) (see Text S7 in Supporting Information S1). Divergences measure the distance between two probability distributions P and Q, where P is the known distribution (ISM simulations) and Q is the modeled distribution (emulated predictions). A small divergence is desired, as a shorter distance between the true distribution P and the modeled distribution Q indicates more similarity. KL divergence is a common divergence measurement but is asymmetric and can produce unexpected behavior in modeling scenarios when Q is non-zero and P is zero. Therefore, we also include JS divergence, a symmetric, normalized variation of the KL divergence.

Details are in the caption following the image

Comparison of true Ice Sheet Model Intercomparison for CMIP6 ensemble (Panel a) with the emulated ensemble generated by the neural network (NN) emulator (Panel b) and the Gaussian Process (GP) emulator (Panel c) for 734 held out series. The blue dashed lines denote the 5% and 95% quantiles in each year. Box plots of each distribution at the year 2100 (Panel d) are included for comparison (the whiskers also correspond to the 5% and 95% quantiles), with the true ensemble in green, the NN emulator predictions in orange and the GP emulator predictions in blue. The black line indicates the mean of the ensemble.

Results show that the NN outperforms the GP in distribution approximation over the entire projection interval. For every year between 2016 and 2100 the NN has a lower KL divergence and thus the NN-emulated distribution is much closer to the true ensemble distribution (Figure S8 in Supporting Information S1). The KL divergence of the GP-emulated distribution increases through time as the projection year is further from the known data, increasing drastically after 2080. The KL divergence of the NN-emulated distribution, however, continues to decrease through 2100 meaning that the approximation improves in later years. The early years show a slightly higher KL divergence for the NN, likely due to the model not having access to prior observations for those years.

Figure 6b plots the true ensemble distribution together with the NN and GP emulated distributions for the year 2100. We applied a Gaussian kernel density estimator to the outputs to compute the KL and JS divergences. The GP distribution produces a KL divergence and JS divergence of 199.137 and 0.241 respectively, while the NN produces a KL divergence and JS divergence of 18.259 and 0.074 respectively (Table 1). The predicted distributions with percentile bounds can also be visualized in Figure 7. Distribution comparisons for years other than 2100 can be seen in Text S10 in Supporting Information S1.

5 Conclusion

This study introduces a new framework for creating ISM emulators and proposes a NN architecture as an alternative to the GP. We propose an LSTM architecture, which enables the direct modeling of the time domain and the utilization of a wider range of input forcings due to the dramatic computational speedup it provides. Results show that this architecture offers a substantial increase in predictive accuracy in individual projections as well as ensemble distribution similarity compared to the baseline GP architecture. The NN-emulated results for all 18 sectors, on average, are within 0.46 mm SLE versus 0.73 mm SLE with the GP, and the NN-emulated distribution is much closer to the true distribution based on both KL and JS divergence metrics. The proposed NN-based emulator achieves these improvements in accuracy while maintaining the GP's ability to quantify uncertainty due to the implementation of MCD. Neural networks also have software support in the portability and deployment of models in large-scale applications due to existing software libraries. These software libraries also support GPU training, leading to the training time of the NN emulator being 17 times faster than the GP, which enables accessibility of emulators to a wider range of scientists. GPs remain a good choice of emulator architecture for small data sets due to the ability to integrate prior information into the model. But in ice sheet emulation scenarios with larger data sets, we recommend NNs as an alternative to the GP scenarios due to the improvement in accuracy, ability to quantify uncertainty, faster training times, and increased accessibility.

The level of accuracy and portability of the proposed NN-based emulator will provide climate scientists the ability to efficiently and accurately model the AIS, as well as understand the drivers of future sea level contribution. These emulators allow us to fill in gaps for missing simulations: while some ISMs are relatively inexpensive to run and can perform tens of simulations, others using a finer resolution or more complex stress balance equations (Morlighem et al., 2010) can only perform a very small number of iterations. Using the emulator to complement existing simulations and emulate missing experiments, as seen in Seroussi et al. (2020) and Fox-Kemper et al. (2021), will allow us to have a more complete and robust understanding of ice sheet projections. As emulators continue to play an increasing role in ice sheet projections, this may in turn lead to different designs in protocols for follow-on efforts to ISMIP6. Instead of asking each model to do the same suite of simulations, it may be more beneficial to sample the space more randomly and ask groups of ISMs to do specific suites of distinct simulations. It can also be envisioned that future model intercomparison projects may include intercomparison of emulators.

A key finding of this study is that there is much less of a distinction between the predictive performance of the NN and GP emulators when they are trained on a reduced number of inputs (see Text S5 in Supporting Information S1). We want to highlight that the benefit of using the NN-based emulator is that the NN can handle more data, which is essential for generating more accurate predictions. While there are many potential avenues of further research outside the scope of this study focusing on the usage of more state-of-the-art GP architectures (e.g., variational GPs, Titsias, 2009, sparse variational GPs, Hensman et al., 2013, GP approximation by linear stochastic differential equations, Corenflos et al., 2022, etc.), we argue that the NN maintains a superior ability to handle large amounts of high-dimensional data quickly, which the results here indicate is key for improving accuracy in ISM emulators. The purpose of the large-input NN-based emulator proposed in this study is to approximate the ISMIP6 ISMs, rather than the reduced input challenge that prompted the Edwards et al. (2021) emulator which was useful for addressing missing scenarios for Fox-Kemper et al. (2021). Therefore, the usage of this model, due to variables that may be unavailable in other modeling scenarios (e.g., ISM name, RCP, and high-resolution atmospheric temperature and precipitation), is constrained to sensitivity testing against the ISMs, as well as potentially providing model simulations where ISMIP6 models are sparse but these data are available. In a future study, we plan to develop and implement models that focus on modeling the physical system rather than strictly the ISMIP6 ensemble, which may prove useful in projecting novel climate scenarios as in Edwards et al. (2021) and will have more direct implications on reevaluating IPCC sea level projections of the past (Fox-Kemper et al., 2021) and in the future.

Future work should include improving both the model performance and the ability to quantify uncertainty. The proposed emulator functions as a black box model—no physical knowledge about the system being emulated is incorporated. A promising direction for future research in creating more accurate projections in ice sheet emulation is the use of physics-informed learning to incorporate domain knowledge into the emulator (Kashinath et al., 2021; Raissi et al., 2019). Future work could also include the reduction and analysis of the sources of uncertainty mentioned in Section 3.2. This study shows that the NN-emulated predictive distributions that are used to calculate uncertainty estimates are more accurate, but does not investigate thoroughly the driving factors of this improvement. Additionally, using other tools for quantifying uncertainty, such as deep ensembles (Fort et al., 2019) and conformal prediction (Fontana et al., 2023), could provide new insight into sources of uncertainty, reduce computational inefficiencies in producing uncertainty estimates, and potentially enable more accurate uncertainty estimation. Future work could also include the inclusion of the ice collapse forcing and the investigation of the effects on sea level for possible incorporation into future emulators. Another key limitation of the proposed emulator is that the model ignores the data uncertainty (i.e., the uncertainty present on the forcings generated by the AOGCMs). In order to produce uncertainty estimates that are more representative of the true projection of sea level contribution, future research could focus on propagating the uncertainty on the input forcings through the emulator architecture.

Beyond model-specific improvements, the proposed emulator architecture and uncertainty quantification methods could be used to create emulators for other land ice areas, such as the Greenland Ice Sheet and mountain glaciers (Marzeion et al., 2020), using a similar suite of input parameters. This study can serve as a starting point in advancing the capabilities of climate emulators related to sea level projections.

Acknowledgments

PVK is supported by the National Science Foundation Graduate Research Fellowship Program under Grant 2040433. BFK was supported by the Schmidt Futures Foundation Scale-Aware Sea Ice Project (SASIP). HS was supported by grants from NASA Cryospheric Science Program (80NSSC21K1939 and 80NSSC22K0383). SN was supported by grants from NASA (awards 80NSSC21K0915 and 80NSSC21K0322). Computational resources and services required for this study were provided by the Center for Computation and Visualization, Brown University. We also thank the Climate and Cryosphere (CliC) and their efforts to host ISMIP6, along with the World Climate Research Programme, which coordinated and promoted CMIP5 and CMIP6. We thank the climate modeling groups for producing and making available their model output, the Earth System Grid Federation (ESGF) for archiving the CMIP data and providing access, the University at Buffalo for ISMIP6 data distribution and upload, and the multiple funding agencies who support CMIP5 and CMIP6 and ESGF. We thank all those involved with ISMIP6 for making this research possible. The authors thank Robert Kopp and two anonymous reviewers for their invaluable feedback on the manuscript. This is ISMIP6 contribution number 32.

    Data Availability Statement

    The datasets required to recreate the results from this study can be found on Ghub, a platform that hosts many ice-sheet related datasets and contains all additional information required for access. The AIS datasets can be found at https://theghub.org/dataset-listing (GHub, 2023a), and additional information about access, data variables, experimental protocol, and more can be found on the ISMIP6 wiki page, found at https://theghub.org/groups/ismip6/wiki/ISMIP6-Projections-Antarctica (GHub, 2023b). These datasets, which take up 640 GB of disk space, are open-sourced and are available to download.

    For transparency and reproducibility of all methodology, the code was compiled into ISE: Ice-sheet emulators, a repository for processing, modeling, and testing ice sheet emulators. This repository includes all processing of the raw ISMIP6 data as obtained from the original repositories, feature engineering, model training and evaluation, as well as the reproduction of all figures within this study with readable API documentation. The code for this manuscript can be found at https://doi.org/10.5281/zenodo.10416634 (Van Katwyk, 2023), and the most up-to-date code version can be found at https://github.com/Brown-SciML/ise.