A Multifidelity Framework and Uncertainty Quantification for Sea Surface Temperature in the Massachusetts and Cape Cod Bays

We present a multifidelity framework to analyze and hindcast predictions of sea surface temperature (SST) in the Massachusetts and Cape Cod Bays, which is a critical area for its ecological significance, sustaining fisheries and the blue economy of the region. Currently, there is a lack of accurate and continuous SST prediction for this region due to the high cost of collecting the samples (e.g., cost of buoys, maintenance, severe weather). In this work, we use SST data from satellite images and in situ measurements collected by the Massachusetts Water Resources Authority to develop multifidelity forecasting models. This multifidelity framework is based on autoregressive Gaussian process schemes that systematically exploit all correlations between data from multiple heterogeneous spatiotemporal sources with various degrees of fidelity. This enables us to obtain implicitly their functional relationships and, at the same time, quantify the uncertainty of the data‐driven predictions. Specifically, in the current work, we develop and validate progressively more complex models, including temporal, spatial, and spatiotemporal multifidelity hindcast predictions of SST in the Massachusetts and Cape Cod Bays. Together with these predictions, we present for the first time uncertainty maps for the region.


Introduction
In the Gulf of Maine, Massachusetts and Cape Cod Bays encompass an area of highly productive fisheries, aquaculture facilities, and ecological resources that provide services to coastal communities. In this area, the Stellwagen Bank National Marine Sanctuary is of particular relevance by supporting strong local economies through commercial and recreational fisheries and providing more than 80% of the whale watching tourism in New England (https://sanctuaries.noaa.gov/science/socioeconomic/ factsheets/stellwagenbank.html). The future of these benefits is uncertain as sea surface temperatures in the Gulf of Maine are increasing at one of the highest rates in the world's oceans, with clear consequences in reducing Atlantic cod recruitment that contributed to overfished stocks (Pershing et al., 2015). Similarly, in New England, the lobster industry has migrated almost entirely into Maine, being no longer supported in southern coastal states as in the past (Steneck et al., 2013), while about 40 other fish and invertebrate species are considered as highly vulnerable to climate change and its variability (Hare et al., 2016). In the summer of 2018, sea surface temperature (SST) anomalies were 1 to 3 • C above the 30-year average in the Gulf of Maine. These anomalies and ocean heat waves are becoming common in the Gulf of Maine. On 8 August 2018, the SSTs measured were the second warmest ever observed there (https://www.drought.gov/drought/sites/drought.gov.drought/files/media/reports/regional_outlooks/GOM %20Summer%202018.pdf). In view of these changes and the likely scenario they would increase in the near future, more accurate and timely predictions of sea temperature at regional and local scales might better help resource management and economic planning.
The predictability of environmental drivers is a key step in the process of linking climate and biological processes for better management of marine resources. Advances in climate prediction facilitate this management, with a particular focus on dynamical seasonal to decadal prediction systems derived from Global Climate Models (Tommasi et al., 2017). Considerable progress has been made with this modeling approach, but some limitations exist in coastal predictions at regional scales, particularly in the U.S. east coast (Stock 10.1029/2019EA000954 et al., 2015. In addition, in situ data are relatively scarce in this region from buoys, but the Massachusetts Water Resources Authority (MWRA) sustains a comprehensive monitoring program for a wide range of bay conditions (Werme et al., 2017). The MWRA measurements provide an accurate but coarse space-time measurements of various quantities of interest in the Massachusetts and Cape Cod Bays, including seawater temperature. Finally, satellite measurements have better temporal resolution but they are gappy due to cloud coverage and require calibration. Therefore, by using less conventional approaches as the one presented in this study, we aim to advance the predictability of SST in the region by using all available sources of data of variable fidelity.
Various data assimilation methods have been used to combine satellite data and in situ measurements. Among these methods, optimal interpolation (OI) is the most widely used data assimilation technique for ocean and meteorological application (see e.g., Daley, 1992). Most weather centers used OI for weather forecasts during the 1970s and 1980s. OI is also widely used for combining satellite data with in situ measurements (see e.g., ocean group for high resolution SST products; Dash et al., 2012). OI combines the background measurements u b and observations u o via the analysis equation given by: where H is the observation operator that interpolates from background field to observation space, and P and R are background and observation covariances. The interpolator H is often chosen a priori, and the effect of its misspecification must be accounted for (Liu & Rabier, 2006). For sparse background measurements and observations available at different scales, determining P, R, and H is nontrivial, and this has been the subject of intense research (see e.g., Oke & Sakov, 2008 and references therein). In addition to OI, variational methods are also used for the purpose of data assimilation. In these methods, a cost function is defined that seeks to find a smooth field while penalizing the distance between the field and the observations. In Li et al. (2015), the 4D variational data assimilation algorithm was used to combine simulation predictions with in situ measurements in Gulf of Maine with very good results. Empirical orthogonal functions have also been used for field reconstructions from noisy measurements (Yang et al., 2010) and SST data assimilation (Smith et al., 1996). In Smith et al. (1996), the empirical orthogonal function modes were computed from OI-analyzed SST fields, and in situ measurements were used to determine the time dependence of each mode. More recently, other techniques have been proposed to blend satellite data with in situ measurements and to fill in the gaps in sattelite-derived SST products, including hierarchical Bayesian algorithms (Zhu et al., 2017(Zhu et al., , 2019, Bayesian maximum entropy (Li et al., 2013), and multiresolution wavelet-based algorithms (Chin et al., 2017).
Gaussian processes (GP) are nonparametric Bayesian machine learning techniques for learning functions from data (Rasmussen & Williams, 2006). GP provides a flexible prior distribution over functions, enjoys analytical tractability, and has a probabilistic workflow that yields robust posterior estimates and handles uncertainty in a principled manner. Before conditioning on the training data, a GP model is completely specified via its mean function and covariance function, which is called kernel and determines how the GP model extrapolates and generalizes to new data. There are many choices for kernel. For example, linear regression, splines, Kalman filters, and OI are all examples of GPs with particular kernels (Reece & Roberts, 2010). The kernel of a GP is often parameterized and the hyper-parameters are learned from the observed data, for example, using the maximum likelihood. This allows for further fine-tuning the kernel to the training data. The training also determines hyper-parameters associated with the uncertainty model. Training the kernel and the uncertainty model replaces the guessing work needed in OI, in which various matrices and noise models must be determined a priori.
The novelty of our approach is based on the use of a multifidelity framework for obtaining SST, exploiting correlations in space-time for sea surface, and estimating both mean values as well as their uncertainty as surface maps in the Massachusetts and Cape Code Bays by combining in situ measurements with satellite data. This framework allows us to replace the usual line error bars by space-time error uncertainty maps. Our ultimate goal is to develop a regional model that could even monitor acidification, similar to the three regional aragonite models for the U.S. West Coast (Alin et al., 2018;Davis et al., 2018;Juranek et al., 2009), but continuously updated and endowed with uncertainty quantification and forecasting capabilities that will drive future measurements.  (Werdell et al., 2013). Right: Massachusetts Water Resource Authority (MWRA) in situ measurements shown in red. The NERACOOS buoy A01 is shown in blue, and it is used only for validation purposes.

Data Sources
For the estimation of SST in Massachusetts and Cape Cod Bays, we consider two sources of data: (1) High-Fidelity: The MWRA measurements. The MWRA routinely measures SST and various other quantities at fourteen stations in Massachusetts and Cape Cod Bays . These stations are shown in Figure 1 (right). The MWRA measurements are taken at various depths and on a monthly basis during nine surveys per year with the exception of winter months. In this manuscript, we use MWRA measurements taken at 1 m below the surface as an estimate for seawater surface temperature.
(2) Low Fidelity: The MODerate-resolution Imaging Spectroradiometer (MODIS) Terra on board NASA satellite with the spatial resolution of 4 × 4 km (Werdell et al., 2013). The satellite images are processed at level 3 with daily and  nightly temporal resolutions and are provided through the NASA Earth Science Data System (ESDS) project. It can be accessed via Physical Oceanography Distributed Active Archive Center (PO.DAAC). Throughout this manuscript, we will use the MODIS Terra, thermal-IR SST level 3, 4 km, daily. We consider the satellite data in the region that spans the longitude of 69 • 58 ′ W to 70 • 58 ′ W and the latitude of 41 • 41 ′ N to 42 • 36 ′ N. The satellite images are gappy due to cloud coverage as shown in Figure 1 (left). The observation counts of MODIS satellite for years 2015 and 2016 are shown in Figure 2. It is clear that during summer, there are significantly more satellite data available compared to late fall, entire winter, and early spring. Although the MODIS satellite data used in this study are level 3 SST products, that is, they are blended with in situ measurements when compared against MWRA measurements, the MODIS satellite measurements show up to 3 • C discrepancy. We expected that some of this disagreement resulted from comparing surface with subsurface measurements (MWRA data from 1 m deep), and some of it may come from the coarse satellite resolution 4 × 4 km. However, the correlation plot between the MWRA measurements and MODIS satellite measurements in the year 2015 and 2016 indicated that this bias is not consistent along the temperature interval ( Figure 3). Therefore, we use the MODIS satellite measurements as a low-fidelity data source. Overall, the satellite data provide regional-scale patterns while the MWRA measurements are accurate but scarce in situ measurements.

Multifidelity Modeling
In this section, we present the details of the multifidelity modeling based on GP regression (GPR; Rasmussen & Williams, 2006), which is a nonparametric Bayesian machine learning technique that enjoys analytic tractability and provides a fully probabilistic framework for approximating functions and treating approximation uncertainties in a principled manner.

Multifidelity Scheme
For the sake of brevity we present the linear formulation for two fidelity levels (Babaee et al., 2016;Kennedy & O'Hagan, 2000;Prempraneerach et al., 2017); however, the formulation can be extended to n > 2 levels of fidelity in a straightforward manner (see e.g., Forrester et al., 2007) and to nonlinear autoregressive schemes using deep GP (see Perdikaris et al., 2017). To this end, we consider the low-fidelity (L) and high-fidelity (H) observations to be of the form For temporal SST multifidelity models D = 1, where x represents time, for spatial models D = 2 with x representing the pair of longitude and latitude, and finally, for spatiotemporal models D = 3 with x representing the triplet of longitude, latitude and time. In equation (2), u L (x) and (x) are assumed to be two independent random fields, each represented by a GP in the form of and L and H represent the unbiased errors in the low-fidelity and high-fidelity observations, respectively. These errors may be caused by unresolved scales due to undersampling or measurement noise corruption, and they are modeled as a centered Gaussian noise L ∼  (0, 2 ). The primary source of noise in our high-fidelity/low-fidelity data, however, is the effect of unresolved scales, because of coarse spatial and temporal resolution of the training data. These subgrid scales are commonly modeled via isotropic Gaussian noise in ocean/atmospheric problems (see e.g., Daley, 1992, Chapter 4). This does not mean, however, that these errors are completely homogeneous. As reasoned in Daley (1992), the inhomogeneities in the noise are often on a very large scale, while the noise correlation lengths are much 10.1029/2019EA000954 smaller. Therefore, when the scale of the inhomogeneity is much larger than the scale of the correlation, the assumption of homogeneous Gaussian for the noise is not a major source of the modeling error.
The multifidelity scheme given by equation (2) tolerates and learns the systematic bias in the low-fidelity measurements. This bias is modeled by the scaling factor 1∕ and the bias correction of − (x, t)∕ . In cases where the high-fidelity data have systematic bias, for example, due to sensor malfunction, the uncertainty has to be modeled explicitly. We use the square-exponential kernel in most cases presented in this paper (for hindcast predictions) as specified by where and i are hyper-parameters. Therefore, the kernel at each fidelity level contains D + 1 hyper parameters, which in addition to the hyper-parameters introduced by n L , n H , and results in 2D + 5 hyper-parameters in total.

Training
The hyper-parameters are learned from the low-fidelity and high-fidelity observations. The low-fidelity observations are denoted by {x L , y L }, where x L is a n L × D input points and y L is a n L × 1, representing SST satellite measurements, and n L is the number of low-fidelity measurements. Similarly, the high-fidelity observations are denoted by {x H , y H }, where x H with the size of n H × D and y H with the size of n H × 1 are high-fidelity input points and (MWRA) measurements, respectively, and n H is the number of high-fidelity measurements.
Since u L (x) and (x) are independent GP, using the multifidelity scheme given by equation (2), it is straightforward to observe that the low-fidelity and high-fidelity observations have a joint Gaussian distribution given by where y is an n × 1 vector that contains the low-fidelity and high-fidelity observations at points with the size of n × D, n is the number of high-fidelity and low-fidelity data points, that is, n = n H + n L , and where k LL (x L , x L ) is an n L ×n L matrix that represents correlations between the low-fidelity points, k LH (x L , x H ) is an n L × n H matrix, representing the correlations between the low-fidelity points and high-fidelity points, k HH (x H , x L ) is an n H × n H matrix that represents the correlations between high-fidelity points, and The hyper-parameters ( L , , , n L , n H ) are obtained by minimizing the above function (i.e., maximizing the likelihood). The minimization problem is solved using the trust-region method. The kernel matrix K has a size of n × n and is a full matrix; therefore, it has a memory requirement of (n 2 ) and its inversion has a computational complexity of (n 3 ). Although we used this approach, for large data sets, this is prohibitive. To address this issue, a recursive autoregressive scheme has been suggested (Le Gratiet & Le Gratiet, 2014), in which the data sources of the two levels of fidelity can be decoupled and equivalently reformulated in a recursive two independent Gaussian regression problems. This requires that the high-fidelity and low-fidelity data sets are nested.

Prediction
Once the hyper-parameters are found, the probabilistic model can be used for the prediction of the mean values at new observed locations and times along with the associated uncertainty. These predictions can be computed for any arbitrary point, and they are not tied to the location of the training data, for example, to satellite grids. In that sense, the predictions of the multifidelity model are continuous. The resulting posterior distribution at the high-fidelity level is taken as the prediction of the multifidelity model. The probabilistic prediction at arbitrary points x * correspond to conditioning the joint Gaussian prior distribution on the low-fidelity/high-fidelity observations (y). To this end, we observe that the joint probability density function (PDF) of the new prediction points and the training points is also a Gaussian given by The above Gaussian relationship between the prediction points and high-fidelity/low-fidelity data can be used to find the conditional distribution to make predictions: where H = q T K −1 y is the posterior mean (i.e., predictions), the variance of the prediction is obtained by 2 H = diag(k HH (x * , x * ) − q T K −1 q), and ±2 H represents the uncertainty of the multifidelity predictions.

Comparison with Other Methods
In this section, we compare GPR and multifidelity modeling with other existing approaches. We focus on two essential aspects of the presented multifidelity formulation, namely, regression and data assimilation.

Regression
First, we compare GPR with polynomial regression. GPR is a nonparametric regression technique, in which no explicit basis function is assumed, unlike the polynomial regression, which is a parametric regression technique and requires the specification of the basis (monomials). To compare GPR and polynomial regression, we consider synthetic data generated by function u(x) = (6x − 2) 2 sin(12x − 4) corrupted by noise. The size of the training set is 500. This amounts to a single-fidelity regression as there is only one data source. In Figure 4, the results of GPR and polynomial regression of different orders (P = 5, 10, and 20) are shown. It is clear that P = 5 lacks the required model complexity to accurately discover u(x) and P = 20 is too complex. GPR, however, discovers u(x) accurately (see Rasmussen & Williams, 2006 for a comprehensive discussion on GPR).

Data Fusion
Second, we compare the multifidelity model with OI. In OI terminology, low-fidelity and high-fidelity data are referred to as background and observation, respectively, and the mean of the posterior is referred to as analysis. We consider two sources of data that are synthetically generated: (1) High-fidelity by 4 samples of function u H (x) = (6x − 2) 2 sin(12x − 4) and (2) low-fidelity by 21 samples of function u L (x) = 0.5u H (x) + 10(x − 0.5) − 5. The location of high-fidelity samples are chosen to be a subset of low-fidelity samples (same x values), which makes it unnecessary to specify the interpolator matrix H in equation (1). In Figure 5a, the single-fidelity GPR with the four high-fidelity samples are shown. Obviously, four samples of u H (x) are not sufficient for an accurate regression, which is reflected in large uncertainty bands. We assume a square-exponential kernel for OI background and observation noises: , where 2 and l c are the variance intensity and the correlation length, respectively. Therefore, P = k(x L , x L ; 2 P , l 2 c ) and R = k(x L , x L ; 2 R , l 2 c ). We consider P = R = 1. In Figures 5b and 5c, we compare the results of multifidelity model with OI with two different correlation lengths of l c = 1 and l c = 0.01. The multifidelity model recovers the true function with very good accuracy, which is reflected in small uncertainty. It is clear that the mean values of OI exhibit large discrepancy with the truth. This is mainly because in OI, both low-fidelity (background) and high-fidelity (observation) are assumed to be unbiased. This can be observed by subtracting the truth from both sides of equation (1) and obtain an equation for the error (for more details, see Daley, 1992). The correlation length affects the length of the region corrected by the high-fidelity observation. This is clearly demonstrated in Figures 5b and 5c, where l c = 1 widens the neighborhood corrected by the high-fidelity observation, while l c = 0.01 only improves the value of the assimilated data at the location of high-fidelity observation. On the other hand, the multifidelity model accounts for biased low-fidelity data and learns the bias (x) from the data, while the high-fidelity must be unbiased. Moreover, the multifidelity model learns the correlation length of the covariance matrices as hyper-parameters.

Demonstration Examples
In this section, we will demonstrate the application of the multifidelity framework in estimating SST in Massachusetts and Cape Cod Bays. In particular, we present results of three SST multifidelity models: (1) temporal, (2) spatial, and (3) spatiotemporal. For validation purposes, we split the MWRA measurements to two disjoint sets: train set that is used in model training and the test set that is used to validate the multifidelity predictions. In all of these demonstrations, we perform hindcast predictions.

Temporal Multifidelity Model of SST
With the proposed framework, we built a temporal multifidelity model for three MWRA stations: F02, F23, and N18 in the span of 2 years (2015-2016; see Figure 6d for the location of these stations in the Massachusetts Bay and Cape Cod Bay). Temporal multifidelity models are one-dimensional, and therefore, D = 1 with x representing time t, and the low-fidelity kernel takes the form of: where L and L t are the hyper-parameters. The kernel for (x) is defined similarly with the hyper-parameters and t . With the addition of , L n and H n , the temporal multifidelity model has seven hyper-parameters, which are learned from the satellite data and MWRA measurements. During these 2 years, 18 in situ measurements were taken at each station. In our models, we randomly chose n H = 11 of those MWRA measurements as our high-fidelity training data and used the 7 remaining for validation purposes (stars and circles, respectively, as shown in Figures 6a-6c). For the low-fidelity data, we linearly interpolated the satellite measurement onto the location of the stations. Overall, we had n L = 125, n L = 64, and n L = 113 low-fidelity points for stations F02, F23, and N18, respectively. Due to cloud coverage, we did not have satellite data for every day for all stations. In particular, station F02 has a low number of satellite measurements. In Figures 6a-6c, the mean values of the multifidelity model along with uncertainty band (±2 standard deviation) are shown for stations F02, F23, and N18. We note the smaller uncertainty near the MWRA measurements, that is, high-fidelity data points. Also, station F02 has a smaller uncertainty band compared with that of stations F23 and N18. The smaller uncertainty estimate of the multifidelity model for station F02 was further justified by accurate hindcast of the MWRA test points, which are not used in training of the multifidelity model. For stations F23 and N18, the MWRA test points also lie within or near the uncertainty band.

Spatial Multifidelity Model of SST
As the second demonstration, we built space multifidelity models in the Massachusetts and Cape Cod Bays for fixed days by combining MWRA measurements with the satellite data. For the spatial 10.1029/2019EA000954 multifidelity model, D = 2 and x = {x 1 , x 2 }, where x 1 and x 2 are longitude and latitude, respectively. We use the square-exponential kernel for low-fidelity model as expressed by and similarly, we employ a square-exponential kernel for (x). These two kernels contain six hyper-parameters: With the addition of , n L and n H , the spatial multifidelity model has nine hyper-parameters.
For our demonstration, we built spatial multifidelity models for the days of 20 March 2015 and 18 May 2016, when MWRA measurements were available. We note that the MWRA measurements of different stations are not taken simultaneously, but they are all taken within a 24-hr time interval. However, we assume that these measurements are taken at exactly the same time. This assumption introduces a small uncertainty that will be estimated via the noise model H in the high-fidelity GP. In Figures 7b and 7c, the mean and the standard deviation of the multifidelity model for 20 March 2015 are shown. To build this multifidelity model, we used the satellite data shown in Figure 7a as the low-fidelity data. We observe that the uncertainty is small near the MWRA stations, and away from the stations, the uncertainty increases (Figure 7c). Comparing the satellite data and the mean of the multifidelity model reveals that the multifidelity model relies on low-fidelity data albeit with a bias correction in regions where no MWRA measurements are available.
To demonstrate the effect of MWRA measurements on the multifidelity model, we built another multifidelity model for 20 March 2015, where the measurement of the MWRA station F29 was added to the training set. The results for this case are shown in the second row of Figure 7. It is clear that the addition of station F29 expands the low-uncertainty region near north of Cape Cod, and it results in a significant update in the multifidelity prediction, specially since the region near the location of station F29 is not densely sampled by MWRA nor is it covered with satellite data for this particular day.
In the last row of Figure 7, the predictions of the multifidelity model for 18 May 2016 are shown. For this day, the satellite measurements provide better coverage in both Massachusetts Bay and Cape Cod Bay. The results show that the satellite and MWRA measurements are correlated and that the multifidelity model corrects the bias uncertainty of the satellite with the inclusion of the MWRA measurements.

Spatiotemporal Multifidelity Model of SST
In this section, we demonstrate the results of the spatiotemporal multifidelity model of SST in the Massachusetts and Cape Cod Bays. This is our most comprehensive model as it exploits both spatial and temporal correlations of the SST data in a multifidelity framework. The spatiotemporal model is three-dimensional (D = 3) and x = {x 1 , x 2 , x 3 }, where x 1 represents longitude, x 2 latitude and x 3 time. There are 11 hyper-parameters for this model. The low-fidelity kernel is expressed by and it contains four hyper- Similarly, the square-exponential kernel for (x) introduces four hyper-parameters }. The multifidelity model also contains three other hyper-parameters: the scaling factor and the low-fidelity and high-fidelity standard deviations expressed with n L , n H , respectively.
We considered 14 MWRA stations in our analysis, from which the SST measured at 11 stations in the span of years 2015-2016 are selected as the training set for high-fidelity points. The remaining three stations are used as a test set for validation of the multifidelity model. The training and test MWRA stations are shown in Figure 8f with square and circle symbols, respectively. Overall, there are n H = 195 high-fidelity points available. The satellite images are used as monthly snapshots, that is, skipping every 30 days resulting in n L = 2, 526 low-fidelity points. Using the entire satellite set of measurements with daily temporal resolution would result in over fitting of the multifidelity model. Besides the issue of over fitting, for cases with large training data sets, the computational cost of training the multifidelity model becomes increasingly prohibitive, as the computational complexity of training the multifidelity model is (n 3 ), where n = n L + n H . This will also be the case for example when high-resolution measurements are needed to build multifidelity model with fine space-time resolution; (see Gardner et al., 2005) for such measurements in the region. Recently, parametric GP has been developed that enables training GP with big data (Raissi et al., 2019) and  it also alleviates the problem of over fitting. The parametric GP enables effective mini-batch training procedures with the computational complexity being reduced to (Mn 3 B ), where M is the number of mini-batches, and n B is the number of training points is each mini-batch. Given that the MWRA and satellite data are provided on monthly intervals, the corresponding multifidelity model can resolve up to monthly time scales.
In Figures 8a-8c, the multifidelity model is validated with the test data set of three MWRA stations (F13, F29, and N04) in the span of 2015-2016. In all three cases, the MWRA measurements are in good agreement with the mean of the multifidelity model. These measurements lie mostly within the uncertainty bands, with a few cases nearby. Among these three stations, F29 shows the largest overall uncertainty as it is not surrounded by MWRA stations that are used in the training of the model (see Figure 8f for the location of F29). On the other hand, N04 has the lowest overall uncertainty and the multifidelity mean yields the best prediction among the three selected stations. In Figures 8 d and 8e, the mean of the multifidelity model and its uncertainty are shown for stations F02 and N07. The measurements of both of these stations are used in the training of the model, and as a result, the uncertainty becomes very small near the time of the measurements. Note that in the absence of the noise model in the high-fidelity GP, represented by H , the multifidelity mean will pass through the MWRA measurements, that is, zero uncertainty. The term H estimates the noise in the MWRA measurements and the unresolved spatiotemporal scales in the multifidelity model.
In Figure 9a, we perform an independent validation of the spatiotemporal multifidelity model by comparing the hindcast prediction of our model with the in situ measurements of buoy A01 for the years 2015 and 2016. The buoy A01 is part of the Northeast Coastal Ocean Observation System NERACOOS, and it is located at 42 • 31 ′ N and 70 • 34 ′ W as shown in Figure 9b with a square symbol. The data for this buoy can be downloaded from: http://www.neracoos.org. The buoy A01 measurements are taken hourly, but the satellite and MWRA training data of the spatiotemporal multifidelity model have a monthly temporal resolution. As a result, the higher frequency oscillations are not resolved in the multifidelity model and they are accounted for with the uncertainty as can be seen in Figure 9a. Overall, a good agreement between the mean of the multifidelity model and the buoy measurements is observed.
In Figure 10 ,we further illustrate the learning of the large-scale seasonal dynamics via the MWRA measurements in the spatiotemporal multifidelity model. In this figure, we show snapshots of the mean and the standard deviation taken every 10 days from 1 January to 21 May 2015 and the approximate days of MWRA measurements (red circles). We observe that uncertainty is consistently lower near the MWRA measurements. However, during the winter, uncertainties are larger compared to late April and May. This is due to the scarcity of satellite data and the lack of MWRA measurements during winter months. This is further confirmed by comparing the measurements of buoy A01 with the multifidelity predictions. In Figure 10, we show the prediction of the multifidelity and its uncertainty (i.e., ±2 ) at the location of the buoy A01 for two days. The buoy measurements are also shown for comparison. On 1 January 2015, the uncertainty is large (i.e., ±4.6 • C) due to lack of data, while on 1 April 2015, the uncertainty is lower (i.e., ±1.4 • C). In both cases, the buoy measurements lie within the uncertainty band.

Summary and Future Work
The presented framework goes beyond traditional predictions with line error bars to continuous space-time uncertainty maps by utilizing and developing modern machine learning techniques that enable fusion of data from all available sources with varying degrees of fidelity. In Figure 11, we compare the difference between single-fidelity Gaussian regression model with the multifidelity model for 20 March 2015 (top row) and 18 May 2015 (bottom row). In the left column, the satellite data on the corresponding days are shown, and the middle column shows the Gaussian regression of the satellite measurements, that is, the single-fidelity model. The right column shows the mean values of the multifidelity model, where we have used MWRA measurements as the source of high-fidelity data. The comparison of the mean values of the single-fidelity and multifidelity models clearly shows that these two temperature maps are correlated. However, the multifidelity model corrects the single-fidelity model up to 3 • C on 20 March 2015 and roughly 1-2 • C on 18 May 2015.  Our long-term goal is to analyze and forecast various quantities of interest and their uncertainty directly related to ocean acidification, that is, salinity, temperature, dissolved inorganic carbon, pH, total alkalinity, using a flexible Bayesian data fusion framework that leaves no data behind -including 3D physics-based models (see e.g., Xue et al., 2014). In particular, we will study the effect of kernel on hindcast and forecast predictions. Kernels can be chosen to encode various structures in the data including smoothness, stationarity, and seasonality. In Figure 12, we show the performance of square exponential and periodic kernels and k PER (t, t ′ ) = 2 exp in forecasting the SST. In the above kernels , l, and p are hyper-parameters. The data used in Figure 12 are obtained from buoy measurements located in the Gulf of Maine at 43.02 • N, 70.54 • W representing SST measurements during more than 9 years -spanning from July 2006 to November 2015. It is evident that for this long time record, the square exponential kernel fails to learn the seasonality beyond the training data, and therefore, it cannot be used for reliable forecast. This is also reflected in the large uncertainty bands. On the other hand, the periodic kernel captures the seasonality and it yields accurate forecast over a long period of time. Choosing an optimal kernel can be automated via Bayesian optimization (Gardner et al., 2015;Martinez-Cantin et al., 2009), or via neural-net-induced GP (Pang et al., 2019; for a broader list of research in this area, see the Automatic Statistician Project, https://www.automaticstatistician.com).
Ultimately, we intend to build 3D volumetric maps of these quantities in the Massachusetts and Cape Cod Bays, including Boston Harbor and the Stellwagen Bank. This will enhance our understanding of coastal ocean acidification and our ability to focus our monitoring efforts in places of high uncertainty and providing management strategies in regions with relative low aragonite saturation means that could negatively affect valuable resources.