Linear Inverse Modeling for Coupled Atmosphere‐Ocean Ensemble Climate Prediction

Paleoclimate data assimilation (PDA) experiments reconstruct climate fields by objectively blending information from climate models and proxy observations. Due to high computational cost and relatively low forecast skill, most reconstruction experiments omit the prediction step, where a forecast is made from the previously reconstructed state to the next time proxy data is available. In order to enable this critical aspect of PDA, we propose an efficient method of generating forecast ensembles of coupled climate fields using a linear inverse model (LIM). We describe the general calibration of a LIM on multiple fields using a two‐step empirical orthogonal function field compression to efficiently represent the state. We find that a LIM calibrated on global climate model (GCM) data yields skillful forecasts, including for out‐of‐sample tests on data from a different GCM. The deterministic forecast skill tests for scalar indices show that the LIM outperforms damped persistence at leads up to 3 years and has skill up to 10 years for global average sea surface temperature. Analysis of 1‐year forecasts reveals that the LIM captures dynamic climate features with local and remote predictability related to teleconnections. The forecast ensemble characteristics of the LIM, which in part determine the weighting of information for PDA experiments, show that the LIM generally produces ensemble forecast errors that are 10% to 70% larger than ensemble variance for 1‐year forecasts on data representative of the last millennium. These results show that the LIM produces ensembles with reasonable calibration but also that LIMs for PDA may require some variance tuning to work optimally for data assimilation experiments.


Introduction
The field of paleoclimate data assimilation (PDA) is in a phase of rapid development and expansion due to the demonstrated capability of new ensemble-based assimilation techniques (e.g., Dubinkina et al., 2011;Franke et al., 2017;Hakim et al., 2016) and the availability of global proxy databases (PAGES 2k Consortium, 2017;Anderson et al., 2019). These PDA projects provide new insights into climate dynamics and variability of the past 2,000 years with estimates of multivariate spatial fields and robust uncertainty quantification. In creating a historical analysis, PDA objectively blends climate information as recorded by proxy observations with spatial field relationships obeying the dynamics of climate models. However, a key aspect of traditional data assimilation missing from most PDA studies is the climate model forecast between analysis times (Franke et al., 2017;Hakim et al., 2016;Steiger et al., 2018). The choice derives from the prohibitive computational cost of running ensembles of coupled Earth system models, for which a single 1,000-year coupled climate simulation requires several months to complete on a supercomputer. Additionally, the low skill of climate forecasts on proxy timescales provides relatively low information relative to the cost of production. When using the no-forecast technique, known as "offline" assimilation (Oke, 2002), the temporal character of the reconstructed fields is entirely dependent on the proxy observations. However, studies investigating climate model behavior suggest that oceans are a source of memory and predictable dynamics up to decadal timescales (e.g., Branstator et al., 2012;Hawkins & Sutton, 2009;Zanna, 2012). Accessing this predictive skill can increase the fidelity of climate reconstructions through the propagation in time of information from assimilated proxy observations. Here we propose and evaluate a computationally inexpensive ensemble forecast system to capture this predictive skill for use in future PDA experiments.
Using Earth system models for long-timescale ensemble forecasts in data assimilation experiments presents many challenges, foremost among them, expense. At a minimum, useful model characteristics for such experiments include forecast skill on the timescale of the observations and ensemble statistics that reflect the uncertainty of the modeled system. High-resolution coupled reanalyses from atmosphere-ocean forecast systems have only recently become feasible due to advances in coupled state initialization and assimilation procedures (Laloyaux et al., 2018). However, the lack of instrumental observations further back in time limits this and other similar reanalysis products to the 20th century (e.g., Compo et al., 2011;Poli et al., 2016). Other studies utilize simpler methods of climate forecasting for reconstructions on longer timescales such as Earth system models of intermediate complexity (Crespin et al., 2009;Goosse et al., 2010) and a coarsened-resolution global climate model (GCM) (Matsikaris et al., 2015;. Perkins and Hakim (2017) demonstrated that an even simpler empirical model (a linear inverse model [LIM]; Penland & Sardeshmukh, 1995) fit on GCM simulation data can be efficiently used for climate forecasting surface temperatures in PDA experiments. Here we expand considerably on Perkins and Hakim (2017) by testing a general method of LIM formulation aimed at simulating multiple fields for coupled climate forecasts on annual-decadal timescales.
LIMs have been employed in various capacities, including exploring ocean-atmosphere dynamics for the El Niño Southern Oscillation (ENSO) (Newman et al., 2011(Newman et al., , 2009Penland & Sardeshmukh, 1995;Vimont et al., 2014) and the Pacific Decadal Oscillation (PDO) (Alexander et al., 2008), usage as a null hypothesis for the statistics of a system (Ault et al., 2013(Ault et al., , 2018Newman, 2013), and assessing predictability for ocean-atmosphere systems (Hawkins & Sutton, 2009;Huddart et al., 2017;Zanna, 2012). Newman (2013) uses a LIM fit to observational data and finds comparable, and sometimes superior, skill relative to GCMs when forecasting Instrumental Era surface temperatures up to decadal timescales. In contrast to instrumental observations, GCM output provides a full-space, physically consistent representation of the Earth system. LIMs, which capture linear predictability and uncertainty of a dynamical system, offer an attractive method for harnessing this predictable multivariate GCM information as a low-cost climate forecast alternative.
The remainder of the paper is organized as follows. Section 2 defines the basic equations and construction of a LIM. Section 3 describes model data and the calibration method for a LIM derived from surface ocean and atmospheric fields. Section 4 investigates the skill of the LIM as a GCM approximation by assessing the forecast skill, spectral characteristics, and properties of stochastically forced ensemble forecasts. Finally, section 5 considers two approaches to extending the LIM to include upper ocean heat content (OHC).

Linear Inverse Modeling
A LIM is a linearized representation of a dynamical system. It decomposes the evolution of an anomaly state vector, x, into a deterministic operator, L, and stochastic white noise process, (equation (1), Penland & Sardeshmukh, 1995).
A dynamical system of this form assumes a distinct timescale separation between the slow-varying linearly predictable processes (e.g., climate) and the fast-timescale variations represented as white noise (e.g., weather). Integration of equation (1) gives a direct propagation form, which translates the state from a given time, t, to time t + in the future using the propagation matrix, G , and a random error vector, (t). We determine the propagation matrix from the -lag covariance statistics of the system,

10.1029/2019MS001778
and calculate L using the relationship L = ln(G )∕ . In equation (3), C(n) represents the sample lag covariance of the state for a time lag, n, which is given by C(n) = ⟨ x(t + n)x T (t) ⟩ with angle brackets denoting an expectation (approximated by averaging in time). Using G , the deterministic forecast (omitting noise) for different lead times, N, is given byx By constructing a LIM based on lag covariance statistics, we assume stationary statistics of the system, which implies that the system dynamics are stable and representative for the time period of interest. This assumption requires that an eigendecomposition of matrix L should have eigenvalues with only negative real parts (signifying an exponential decay in time). We note that the individual eigenmodes are not orthogonal, which allows forecast modes to constructively interact to produce transient anomaly growth over some time interval (e.g., ENSO). Individual modes may also come in complex conjugate pairs, signifying an evolving spatial pattern over time. With these properties, a LIM may be considered a "multivariate red noise" model, or more generally, the continuous analog of a multivariate AR1 process (Newman, 2007;2013).
The noise forcing statistics, Q = ⟨ T ⟩dt, for a stable system balance the deterministic energy loss as described by the fluctuation-dissipation relationship (Penland & Matrosova, 1994), Using Q and L, one can perform stochastic integration of (1) to produce a sample trajectory given deterministic dynamics and a noise-induced uncertainty. This can be useful, for example, to evaluate the power spectral density of the LIM over a wide range of frequencies (see section 4.2) and in PDA to estimate proxy values at variable time resolutions. Penland and Matrosova (1994) describe a two-step scheme for performing stochastic integration with a LIM, where y is an intermediate variable,Q is a matrix where columns are eigenvectors of Q, is a diagonal eigenvalue matrix of Q, and is a vector of random numbers drawn from a Gaussian distribution, ∼ N(0, 1). This integration method is specifically designed to preserve the system covariance and noise statistics on which the LIM was calibrated. The resulting model is a predictably linear and stable system that captures broadband spectral peaks related to noise and dynamic interactions in the state. For more detailed accounts of the construction, use cases, and assumptions of LIMs, please see, for example, Penland and Matrosova (1994), Penland and Sardeshmukh (1995), Newman (2007), and Newman et al. (2011).

Experiment Data and Methods
Our experiments consist of a LIM calibration on GCM data and then validation of LIM forecasts on GCM data. We consider validation using three input data sources: the GCM data the LIM is calibrated on, a separate GCM simulation using the same model but with no temporal forcing, and a separate GCM simulation using a different model with the same forcing as the calibration GCM. The span of validation sources allows us to assess the generality of the empirical model on independent data not used in calibration.

Data
We use monthly average field output from numerical simulations of two coupled GCMs, the Community Climate System Model v4.0 (CCSM4) (Landrum et al., 2013) and the Max Planck Institute Earth System Model Paleo Mode (MPI) (Bothe et al., 2013). This includes data from two experiments: the Last Millennium Simulation (past1000), a run covering 850-1850 CE that includes volcanic aerosol, greenhouse gas, and land use change forcings, and the pre-industrial control run (piControl), which is a no-forcing simulation using fixed pre-industrial conditions for more than 1,050 years. Model fields used include variables of interest for a coupled climate reanalysis such as 2-m surface air temperature (TAS), precipitation (PR), 500-hPa geopotential heights (ZG500), sea-level pressure (SLP), sea surface temperature (SST), and dynamic ocean surface height (ZOS). The ZOS field represents dynamical influences on the surface ocean height and can be considered an analog for the integrated effects wind stress on ocean circulation. We also consider the inclusion of 0-700 m OHC calculated from model ocean temperatures in section 5. All fields are regridded to a regular 2 • × 2 • latitude-longitude grid using bilinear interpolation for consistency in the data preprocessing steps.
From the fields of SST and SLP, we calculate scalar indices to assess skill in the aggregate and of specific dynamical features (i.e., these indices are not estimated separately). Indices used here include global average SSTs, the PDO index, the Niño 3.4 index, and the North Pacific Index (NPI). All index calculations involving spatial averages and empirical orthogonal functions (EOFs) use latitude weighting (cos( )) to account for grid cell area variation. The Niño 3.4 index is defined as the average SST over the bounding region 5 • S-5 • N and 170 • -120 • W. The NPI is the average SLP over the region from 30 • N-65 • N and 160 • E-140 • W. For the PDO index, we use the definition first described in Mantua et al. (1997) where we first determine the leading EOF of SST variability in the North Pacific (20 • -70 • N) from linearly detrended CCSM4 past1000 output and then project SSTs onto this pattern to determine the PDO index value.

LIM Calibration
For all forecasting experiments, we use a LIM calibrated on annually averaged CCSM4 past1000 simulation data. We chose this timescale due to most available proxy observations having at least annual resolution. To prepare the calibration data, we first convert monthly data to anomalies relative to climatology and then take the annual average from April to March. This shifted annual distinction helps prevent splitting important climate features like ENSO and the PDO, which reach maximum amplitudes during the winter season. We linearly detrend the data at all gridpoints to remove long-term trends.
Before calculating the lag covariance statistics of the system, we reduce the dimensionality of the state using EOFs. Hawkins and Sutton (2007) present a similar EOF reduction method for two GCM three-dimensional ocean fields, which they later use to investigate GCM ocean predictability with a LIM (Hawkins & Sutton, 2009). Such a reduction compactly represents the major shared variability for a group of fields and makes for efficient LIM calibrations and forecasts. However, when including many fields in the state, the computational cost of determining these EOFs grows rapidly. To reduce the costs of field compression, we employ a two-step EOF reduction method (see Appendix A for details). First, we compress each field individually by calculating its leading 400 EOFs and projecting the field into that space. This first reduction generally retains more than 90% of each field's total variance. We then join the pre-compressed EOF space field components together to reform the state vector and perform a second EOF reduction on the covariance matrix of the combined state. We refer to these final EOFs as our multivariate EOFs. Figure 1 displays the variance retained for each field as a function of the number of retained multivariate EOF modes. For LIM assessment in this study, we opt to truncate the multivariate state to the leading 25 modes. After 25 modes, the amount of variance added by each additional EOF component levels off at around a few percent per mode. The fields of SLP and ZG500 retain the most field variance (approximately 85%), TAS and SST retain just above 70% of the variance, while ZOS and PR retain the least variance at 60% and 54%, respectively.
After reducing the state, the LIM is calibrated using equation (3) with = 1 (1-year lagged covariance statistics). From G 1 , we solve for the dynamical operator (L) and then the noise forcing statistics of the system (Q) using equation (5). We then investigate predictive skill using equation (4), and spectral behavior and ensemble calibration using equations (6) and (7). Sensitivity of 1-year forecast skill and LIM ensemble characteristics to the multivariate EOF truncation are shown in supporting information, Figures S1-S3. We find that forecast skill is generally insensitive to this truncation after retaining 15 to 20 multivariate EOF modes, but ensemble calibration statistics are sensitive to resulting changes in Q.
The noise forcing matrix, Q, should be positive-definite according to the assumptions described in Penland and Sardeshmukh (1995). Failure of this test does not disqualify a LIM as useful, but it does signify numerical instability or that aspects of the underlying assumptions are unmet. Previous studies (e.g., Hawkins & 10.1029/2019MS001778 Sutton, 2009;Penland & Matrosova, 1994;Penland & Sardeshmukh, 1995;Zanna, 2012) often find that Q possesses some negative eigenvalues, which are potentially related to nonlinearities, noise, and fast-decaying forecast modes of L. In this case, standard practice is to remove corresponding negative eigenmodes (e.g., as in Penland & Matrosova, 1994) and rescale Q to satisfy equation (5). However, sensitivity tests here reveal that forecast ensemble spread fluctuates unpredictably when changing the number of multivariate EOFs retained for the LIM calibration. This behavior likely results from a poorly determined noise forcing matrix and resulting numerical effects in the eigendecomposition of Q (see Figures S3 and S4). We perform additional validity tests (e.g., Gaussianity and dynamical operator stability) and find similar performance (Table  S2 and Figure S5) compared to that described in other studies (e.g., Hawkins & Sutton, 2009;Penland & Sardeshmukh, 1995).

Multivariate LIM Validation
We now assess the performance of the multivariate LIM calibrated on annual CCSM4 past1000 data for deterministic and ensemble forecasts as well as spectral characteristics. We calculate the forecast skill by initializing forecasts from GCM data at all available times (using equation (4)) and then validate those forecasts by comparing to the reference GCM data at the specified lead times. For stochastic integration used in the ensemble forecasts and free-running LIM, we use a time step of t = 3 hr. In all cases, we use the full untruncated reference data from the GCM simulations for validation and comparison.

Deterministic Forecast Skill
We measure forecast skill using correlation and the coefficient of efficiency (Nash & Sutcliffe, 1970, CE) tests. A perfect forecast gives a CE value of one, while a value of zero indicates errors at the level of a no-information forecast (e.g., climatology; see Appendix B for skill metric descriptions). Forecast experiments performed on the CCSM4 past1000 data represent the LIM skill in approximating the GCM, while tests on out-of-sample data (CCSM4 piControl and MPI past1000) ascertain generality of the approximation. The MPI past1000 experiment is a harder test in that this model possesses different mean-state structure and climate variability than the calibration model. We also compare skill to a "damped persistence" forecast (univariate AR1 model) as a measure of whether the LIM adds useful information compared to persistence. We determine the AR1 forecast coefficients from lagged autocorrelation (lag = lead time) of the reference GCM data.
The forecast skill of the CCSM4 past1000 case (Figure 2a) shows positive skill with significant correlations up to 10 years for global average SST and up to 4 years for dynamical indices. Predictive skill is substantial for short leads and outperforms AR1 forecasts for leads of at least 3 years. This performance suggests that the LIM adds useful dynamical information for shorter timescales and captures similar memory on longer timescales as the AR1 model. CE values for the dynamical indices reach zero around 4 years, which suggests that predictive skill is lost at that point. AR1 forecast skill at leads of 2 and 4 years for the Niño 3.4 index shows the regular oscillation timescale of ENSO in CCSM4. For a 2-year forecast, the AR1 skill is nearly as good as the LIM forecast. Overall, the NPI skill score is the lowest of the indices considered (correlation of 0.51 and CE of 0.26 at 1-year lead) but shows that this LIM captures predictive information even for noisy fields like SLP.
We consider now LIM skill for independent forecast data. The CCSM4 piControl forecast skill (Figure 2b) shows some reductions for the global average SST verification score relative to the calibration data. For 1-year forecasts, correlation decreases from 0.82 to 0.69 and the CE from 0.67 to 0.43, or 16% and 36%, respectively. However, the AR1 forecast skill between the past1000 and piControl experiments drops 46% for correlation and 71% in CE, so the loss in persistence-only information is larger than the dropoff in LIM predictable dynamics. In contrast, the dynamical indices see little change between the past1000 and piControl CCSM4 forecast skill, suggesting that dynamics related to forcing in the calibration data (primarily volcanic aerosols) do not play a large role in the predictable dynamics of these features. Longer-term predictive skill does drop beyond 6 years for global average SST, which probably derives from the lack of forcing, which can impart global coherent SST changes.
Finally, when forecasting on the out-of-sample data from a different GCM (MPI past1000; Figure 2c), a similar story is evident. The forecasts on MPI data show global average SST skill at long leads (significant correlation up to 6 years) and positive skill for leads of 2-3 years for the PDO and Niño 3.4 indices. The NPI skill, which is the lowest for all cases, only has some predictive skill for correlation at a 1-year lead. Global average SST and PDO skill tend to be closer to AR1 skill, suggesting that skill is more related to memory than dynamical interactions, but the AR1 skill values in the MPI case are also slightly higher than in the CCSM4 past1000 case for short lead times. Regardless, the ability to skillfully predict states from a transient simulation independent of the calibration procedure suggests that the LIM captures essential large-scale dynamics. Figure 3 illustrates the field correlations (TAS, PR, ZG500, and SLP) for 1-year forecasts on the same in-and out-of-sample tests as done with the scalar indices. Again, the spatial forecast skill for the non-independent CCSM4 past1000 data (column a) shows the highest correlation values. The TAS field shows high correlation over the oceans, with noticeable maxima in regions related to ENSO and the PDO, and lower correlations over continental regions. The PR field displays higher skill in the tropical Pacific and Indo-Pacific region and generally low but still weakly positive correlations over the rest of the globe. High skill in the tropical region of the model is likely related to predictable shifts in the intertropical convergence zone connected to ENSO. Both ZG500 and SLP show high bands of correlation in the tropical regions as well, with some increased correlations at higher latitudes presumably related to tropical teleconnections. Compared to SLP, the ZG500 field has more positive correlations with 1-year forecasts outside of the tropics, but both show lower skill over the southern Atlantic and Indian Oceans, North Hemisphere continents, and the North Atlantic. Within the tropics, the SLP field has higher correlations over the tropical Pacific Ocean and the Indian Ocean with lower skill displayed in the Atlantic and over portions of Africa.
For the independent forecast skill tests on CCSM4 piControl and MPI past1000 data (Figure 3), we find many of the same high-correlation features. The piControl 1-year forecast correlations (Figure 3b) compared with the CCSM4 past1000 correlations highlight regions of the tropics and Pacific Ocean. Areas over northern hemisphere continents, Antarctica, and subtropical areas have lower skill than in the past1000 case, which again, suggests the potential that climate forcing in the calibration data set (primarily from volcanic aerosols) provides large-scale coherent variability that increases the skill in these areas. The MPI past1000 spatial skill (Figure 3c) reinforces the idea that some linearly predictable dynamics are shared between the two models and that the LIM successfully captures them. The TAS correlation shows somewhat lower skill compared to the CCSM4 past1000 case but is generally positive outside of the southern ocean region. The PR and SLP field are the least skillful with the LIM showing weakly positive correlations in the tropical Pacific and Indo-Pacific areas, with some areas of anticorrelation in regions where dynamical responses between the models may be shifted or operate on different timescales. The skill correlation for the MPI ZG500 1-year forecasts shows the same tropical belt of high correlation as in the calibration data set, but the teleconnection patterns are different in shape than in the CCSM4 forecast skill fields. Skill maps for ZOS and SST show 1-year forecast patterns similar to SST, with high skill in the tropics and mostly positive skill across forecasting experiments ( Figure S6); one exception is the regions of anticorrelation in the subtropics for the MPI past1000 experiment. Overall, the ability to capture teleconnections and general skill in the tropics shows that robust dynamical relations from one model can be retained to a certain degree when forecasting on out-of-sample data, even from another GCM. Furthermore, the spatial field forecast correlation shows that the LIM approximates well many of the region-specific predictable dynamics of a climate model.

Spectral Characteristics
Another aspect of the LIM validation is the representation of the free-running statistics for model fields. A useful approximation should capture temporal variability and the system's memory in a way that captures the GCM's spectral power across timescales. We show the spectral comparison for selected scalars between the LIM and the reference data (CCSM4 past1000) in Figure 4. The LIM spectra are an aggregation of 250 iterations of a 1,000-year stochastic integration of the LIM using equations (6) and (7). We run each integration for 50 years prior to sampling a 1,000-year integration to remove the memory of initial conditions. From the 250-sample simulations, scalar indices are calculated from fields, and using the multi-taper method (Prieto et al., 2009), we calculate the power density for each measure, average them, and calculate the 95% confidence bounds.
Overall, the spectra (Figure 4) retain the main spectral structure of the CCSM4 past1000 data, and the 95% range generally encompasses the noisier power estimate from the single realization of the CCSM4. The global average SST and PDO power density curves show increasing power out to 20-to 50-year periods with an ENSO-related bump around 3-5 years. The Niño 3.4 and NPI indices show power maximizing in the 3to 7-year period. This confirms that the LIM is capturing both local and remote influences of important dynamical features within the CCSM4 model (e.g., ENSO and the PDO) for multiple fields. As a reminder, the LIM simulates the ENSO-related spectral peak through constructive interactions between eigenmodes of L, which are excited by the stochastic forcing. Relative to the CCSM4 past1000 SST power curve, the LIM has more power at the shortest periods (<3 years) and less at the longest periods (>100 years), and the NPI variability appears to be low for short periods. Ideally, the LIM should capture the correct power at all timescales, but system memory is often imperfectly determined by the dynamical operator (L) due to nonlinearities and sampling error. Furthermore, the rescaling of Q due to negative eigenvalues can flatten the spectral curve, especially if the rescaling is large (not shown). For this LIM configuration, Q had two negative eigenvalues, which resulted in a small rescaling of the remaining 23 noise modes by 3%.

Ensemble Forecast Skill
For PDA applications, the skill of the ensemble mean forecasts and the ensemble spread are key factors that influence the results of assimilating proxy data. The ensemble spread is a measure of uncertainty for the forecast information, and if it is too small, PDA underweights useful information from proxies. For a LIM, the noise fit of the linear system is the determining factor for the ensemble spread when performing stochastic integration. While the LIM calibration procedure guarantees that the covariance of the truncated forecast basis is retained, it does not guarantee that our forecast uncertainty is representative for the full untruncated model fields. Additionally, nonlinearities, the blending of forced responses and system dynamics, and noise all are effectively parameterized by Q. We assess the ensemble characteristics for 1-year 100-member ensemble forecasts initialized from CCSM4 past1000 and piControl data. We use the ensemble calibration ratio (ECR), a measure comparing the average ensemble forecast errors to the average ensemble spread. A "well-calibrated" ensemble has a ratio near one (errors and spread are of comparable magnitude), while larger ratios signify an "under-dispersive" ensemble and ratios less than one signify an"over-dispersive" ensemble (see Appendix B for details). The similar calibration ratios of the dynamical indices between experiments further suggest a minimal role of external forcing in determining the predictable behavior for ENSO and the PDO. However, the change of global average TAS and SST from slightly under-dispersive to over-dispersive for the two forecasting experiments is large and likely related to variance differences in the reference data and volcanic aerosol forcing. First, forcing in the past1000 experiment provides a wider range of climate variability for the LIM to fit. Therefore, an integration of that LIM with noise forcing produces a wider envelope of possible states than one fit on the piControl data (not shown). The larger ensemble spread of a LIM fit to past1000 data outweighs the small increase in forecast error for piControl data suggested by Figure 2, which decreases the calibration ratio. Furthermore, the effect of volcanic forcing in the past1000 experiment increases the forecast errors relative to natural variability. For example, the LIM cannot predict volcanic events, so it will miss the associated sharp cooling in temperature if a volcanic event occurs during a forecast year. Consistently missing these cooling events skews the mean squared error distribution, which increases the calibration ratio in the past1000 case.

Experiments With 0-700 m OHC
Ocean circulation and heat storage are crucial aspects governing long-term climate variability and climate change. However, the challenging nature of observing ocean characteristics at depth and with broad coverage limits reliable instrumental estimates of ocean behavior to the recent past (e.g., Riser et al., 2016). Ocean reanalysis projects use high-resolution ocean models to blend full-field model information with the instrumental record but are similarly limited in temporal extent (Balmaseda et al., 2013;Carton & Giese, 2008;Chang et al., 2013). Two recent studies use passive ocean transport estimations and instrumental and proxy-based SST fields to constrain OHC over an extended period (Gebbie & Huybers, 2019;Zanna et al., 2019). Ensemble PDA is another method to estimate historical ocean heat storage, so anticipating future research on this topic, we now investigate the inclusion of 0-700 m OHC into a LIM.
When including 0-700 m OHC into the multivariate EOF truncation procedure described previously, we find some changes to the decay timescales of the dynamical operator's forecast modes. The largest change  is an increase in e-folding decay timescale (EFT) of the first LIM mode from around 4 years to more than 10 years ( Figure 5, Mvar. OHC700 vs. No OHC700). There are additional modes with smaller increases in EFT, but the decay timescale dropoff for higher forecast modes is similar with the inclusion of OHC in the multivariate EOF (hereafter referred to as mvarOHC) and without. For comparison, to gauge the potential predictability of 0-700 m OHC in isolation, we form a LIM calibrated on the leading EOFs of the OHC field alone and test global average OHC forecast skill across varying numbers of modes retained. For the OHC-only LIM, the 1-year forecast skill of the global average OHC with 25 OHC EOFs shows a correlation of 0.9 and a CE of 0.8 when forecasting on the independent piControl data ( Figure 6). In contrast, the forecast skill for global average 0-700 m OHC using the mvarOHC LIM reaches correlation of 0.6 and a CE of 0.2 with 25 multivariate modes retained. The skill disparity suggests that the multivariate EOF truncation removes information essential to OHC forecasts.
The compactness of the multivariate EOF works well for fields that share a significant amount of their predictable information with the leading modes of common large-scale variability. When a field, such as 0-700 m OHC, does not fall into this category, it can be isolated from the multivariate EOF truncation step, and directly added to the state vector. This explicit field component representation is common in other LIM-based studies (e.g., Ault et al., 2018;Newman et al., 2011). As such, we remove the 0-700 m OHC from the truncation procedure for the multivariate EOF state and, instead, append the leading 20 OHC EOFs to the multivariate components (equation (A4)). In this configuration, the EFT of the leading forecast mode increases to approximately 21 years ( Figure 5, Separated OHC700), and the first four modes have EFTs above 5 years. The skill of global average 0-700 m OHC for 1-year forecasts on CCSM4 piControl data has a correlation of 0.9 and a CE of 0.8, reaching the same skill as the OHC-only LIM. For other fields, the addition of OHC in this manner does not significantly affect the skill at longer leads ( Figure S7). There are very slight increases at extended leads for scalar indices from CCSM4 past1000 forecasts, but no similar increase in the piControl experiments. An examination of the SST field correlations from the past1000 forecasts highlights the North Atlantic as a region that especially benefits from the inclusion of explicit OHC information within the LIM (not shown).  Another important improvement from representing the OHC field independently in the LIM basis is apparent in the spectral power of the global average 0-700 m OHC (Figure 7). When including OHC in the multivariate EOF, the free-running LIM is not able to replicate variability from the base model at long timescales (longer than 50-to 100-year periods). The difference is an order of magnitude between the LIM ensemble mean power (10 17 [Jm −2 ] 2 /year) and CCSM4 past1000 data (10 18 [Jm −2 ] 2 /year). Separately representing OHC in the LIM basis approximately halves the power difference at longer time scales and gives a much better visual approximation to the spectral power of CCSM4 OHC. Both cases do, however, show an overestimation of power at high frequencies (period of 2 to 3 years), which could be due to difficulties in properly fitting noise characteristics for the slowly varying OHC field and Q rescaling (experiment rescalings of −3% for the mvarOHC case and −23% for the separated OHC case). Regardless, free-running statistics are an important measure of model approximation, and by separating OHC in the state, we get a better approximation of the target model.

Concluding Remarks
Here we explored the configuration and usage of a LIM as an approximation of climate forecasts from a coupled climate model. The simplicity of a LIM's representation of predictable dynamics and noise provide a skillful and efficient model to run long-term ensembles. The LIM skillfully predicts aggregate quantities and dynamical indices out to multi-year lead times and is consistent with other studies finding decadal predictability in GCM ocean fields (Branstator et al., 2012;Hawkins & Sutton, 2009). The 1-year LIM forecasts show high skill for scalar quantities (correlation and CE above 0.5 for all SST measures) and the ability to capture local and remote predictability through dynamical teleconnections. We find that all 1-year forecasts outperform the benchmark of a damped persistence (AR1) forecast, which confirms that the LIM is capturing useful information about multivariate dynamics on top of persistence. The smallest outperformance is found when forecasting the state of a different climate model (CCSM4 LIM forecasting output from the MPI model). Forecasts on the MPI data represent a realistic and difficult test where the modeled dynamics are imperfect due to different model biases and different underlying model behaviors. Despite this, the LIM captures some shared linearly predictable aspects of the climate between models, including teleconnections to a certain degree. The ability to capture predictable information and approximate statistical multivariate properties of a coupled climate system showcases the utility and value of a LIM as a forecasting tool.
For climate reanalysis experiments using data assimilation, the weighting of observations and model estimated state depends critically on the ensemble variance, a measure of forecast uncertainty. Too little variance in the ensemble forecasts leads to the situation where observational information may be severely underutilized. Ensemble calibration ratios of 1-year ensemble LIM forecasts reveal that forecasts are generally close to the ideal calibration ratio but are somewhat under-dispersive. The mean ensemble errors are between 10% and 70% larger than the ensemble variance when forecasting on CCSM4 past1000 data. There is a notable discrepancy in the global average temperature calibration ratio between two forecasting experiments (CCSM4 past1000 and piControl) where the unforced reference data suggest that the ensemble is

Journal of Advances in Modeling Earth Systems
10.1029/2019MS001778 over-dispersive, while the forced reference data suggest the opposite. This difference is partly due to the smaller variance of the piControl simulation and unpredictable volcanic forcing of the past1000 simulation. The under-dispersive forecasts for global aggregate and dynamic indices in the past1000 case suggest that the ensemble variance may need slight adjustment for PDA reanalysis experiments.
The method of LIM construction using the two-step multivariate EOF decomposition provides an easy algorithm for including other surface and atmospheric fields, even those with partially redundant information. This type of basis also prevents the rapid expansion of state length, which could quickly move beyond the number of well-determined forecast modes of the LIM's dynamical operator, L. However, based on our attempt at including 0-700 m OHC using a multivariate EOF representation, this method is not ideal for all cases. Aspects of OHC predictability and the spectral power approximation degrade due to the loss of information from EOF truncation. Instead, we find that separately including the OHC in the state produces a better approximation of OHC variability and better forecast skill with a correlation increase of 0.3 and a CE increase of 0.6. The explicit inclusion of field components in the state reduces the truncation error of that field, but the trade-off, in this case, is a larger LIM basis. We use the 0-700 m OHC field as an example here, but a simple extension could involve other fields or groups of fields (e.g., separately grouped and reduced atmosphere and ocean components or the inclusion of ice reservoir fields).
This study, motivated by previous univariate climate reanalysis work using a LIM (Perkins & Hakim, 2017), extends the LIM methodology for general multivariate fields and coupled ensemble forecasts. The results here are broadly applicable to any project seeking to generate climate model constraints without having to run a GCM. For example, one could fit multiple LIMs to various GCMs to get an estimate of the robust predictable response to a given state or to generate a super-ensemble of climate possibilities. The addition of climate forecasts into PDA frameworks will incorporate constraints based on a LIM's encoded memory and dynamics into the reconstructed fields. Forthcoming work will detail the implementation of this forecast model into our PDA framework and explore the results of an "online" coupled atmosphere-ocean reconstruction over the last millennium.

Appendix A: Multivariate Empirical Orthogonal Function Reduction
The predictability encapsulated in a LIM fit to climate data is typically dominated by the primary modes of Earth system variability. Therefore, we can use empirical orthogonal functions (EOFs) to represent the calibration data in a compact space that retains these predictable features. Here, we use a two-step EOF reduction to reduce the computational expense of the EOF compression across multiple climate fields. We describe this reduction process using a two-field example below.
For two climate fields, X A and X O , each with dimensions of M × N (M is a flattened spatial dimension, and N is the number of times), we first latitude weight each grid cell using where x i is a gridpoint time series and i is the corresponding latitude. We then determine the field EOFs, U, using singular value decomposition (SVD; X w = U V T ), truncate columns of U to the leading K modes, and project the data into that basis (e.g.,X A = U A T X A ). We choose K = 400 in this work so that projected data fields generally retain more than 90% of the total variance in this initial compression step.
After the first EOF compression, we standardize the fields, so the total component variance sums to one. For example,X This ensures that the relative magnitudes of the variance for individual field components are retained within a field but also that components across fields are considered equally for the next compression step. We concatenate the EOF-reduced and standardized fields to form the new state space (dimensions of 2K × N in this example),

10.1029/2019MS001778
Finally, we perform an SVD on Y to determine the multivariate EOFs, U Y , retaining the leading L modes. Then we project the state into this multivariate EOF space usingŶ = U Y T Y. With this, the state is fully compressed into the basis used for LIM calibration and forecasting. All EOFs (e.g., U A , U O , and U Y ) are used to project the same fields from other sources into the LIM basis for forecasting experiments.
In the hybrid multivariate EOF and field component state experiments related to 0-700 m OHC field described in section 5, we include OHC components into the state by concatenating them to the end of the multivariate state,Ẑ = [Ŷ

Appendix B: Forecast Skill Metrics
In the forecasting experiments, we validate a series of forecasts, t , against the original reference data from a climate model, v t , for all available times (t = 1, … , T). For deterministic forecasts (equation (4)), we use the Pearson correlation coefficient and the coefficient of efficiency (CE; Nash & Sutcliffe, 1970) as validation metrics for field indices and grid cell data. The correlation (equation (B1)) gives a measure of how well signal timing matches the reference data. The overbar (e.g.,v and̄) signifies a temporal average, and represents the series standard deviation.
CE (equation (B2)) is a much stricter metric that is sensitive to signal phase, amplitude, and bias mismatches between the forecast and reference data. CE is unbounded in the negative direction, a value of zero indicates that errors are of the same magnitude as the series climatological variance, and a value of one signifies a perfect fit.
For ensemble forecasts, we use the ensemble calibration ratio (ECR) to compare the ensemble forecast errors and the ensemble spread. For a probabalistic forecast (at time t) represented by an ensemble (g t ) with E members, the ensemble average squared error (SE) is defined as where g t signifies the ensemble average at time t and the ensemble variance is defined as where g ti is the value of ensemble member i at time t. Using both terms, the forecast ECR is given by This ratio illustrates whether the probabilistic forecasts are on average "well calibrated"; that is, the forecast errors are representative of the forecast ensemble spread (model uncertainty). If the ECR > 1, then forecast errors are larger than the uncertainty and the model is considered "under-dispersive." If the ECR < 1, then the modeled errors are smaller than the spread and the model is considered "over-dispersive." Information based on the ECR can be used to tune the forecast ensemble variance for data assimilation applications.