# Improvements in the GISTEMP Uncertainty Model

## Abstract

We outline a new and improved uncertainty analysis for the Goddard Institute for Space Studies Surface Temperature product version 4 (GISTEMP v4). Historical spatial variations in surface temperature anomalies are derived from historical weather station data and ocean data from ships, buoys, and other sensors. Uncertainties arise from measurement uncertainty, changes in spatial coverage of the station record, and systematic biases due to technology shifts and land cover changes. Previously published uncertainty estimates for GISTEMP included only the effect of incomplete station coverage. Here, we update this term using currently available spatial distributions of source data, state-of-the-art reanalyses, and incorporate independently derived estimates for ocean data processing, station homogenization, and other structural biases. The resulting 95% uncertainties are near 0.05 °C in the global annual mean for the last 50 years and increase going back further in time reaching 0.15 °C in 1880. In addition, we quantify the benefits and inherent uncertainty due to the GISTEMP interpolation and averaging method. We use the total uncertainties to estimate the probability for each record year in the GISTEMP to actually be the true record year (to that date) and conclude with 87% likelihood that 2016 was indeed the hottest year of the instrumental period (so far).

## Key Points

- A total uncertainty analysis for GISTEMP is presented for the first time
- Uncertainty in global mean surface temperature is roughly 0.05 degrees Celsius in recent decades increasing to 0.15 degrees Celsius in the nineteenth century
- Annual mean uncertainties are small relative to the long-term trend

## 1 Introduction

Attempts to seriously estimate the changes in temperature at the hemispheric and global scale date back at least to Callendar (1938) who used 147 land-based weather stations to track near-global trends from 1880 to 1935 (Hawkins & Jones, 2013). Subsequent efforts used substantially more data (180 stations in Mitchell, 1961; 400 stations in Callendar, 1961; “several hundred” in Hansen et al., 1981; etc.), and with a greater global reach. While efforts were made to estimate the uncertainty associated with these products, they were more suggestive than comprehensive.

As the data sets have grown in recent years (through digitization and synthesis of previously separate data streams; Freeman et al., 2016; Rennie et al., 2014; Thorne et al., 2018), and efforts have been made to improve data homogenization, bias corrections, and interpolation schemes, the sophistication of the uncertainty models has also grown. Notably, with the introduction of the Hadley Centre sea surface temperature (SST) analysis HadSST3 (Kennedy et al., 2011a, 2011b), Berkeley Earth (Rohde et al., 2013a), and the joint Hadley Centre and University of East Anglia's Climatic Research Unit Hadley Centre/Climatic Research Unit 4 (HadCRUT4; Morice et al., 2012), Monte Carlo methodologies have been applied to generate observational ensembles that quantify uncertainties more comprehensively than was previously possible.

Goddard Institute for Space Studies Surface Temperature (GISTEMP) is a widely used data product that tracks global climate change over the instrumental era. However, the existing uncertainty analysis currently contains only rough estimates of uncertainty on the land surface air temperature (LSAT) mean and no estimates of the SST or total (land and sea surface combined) global mean. This paper describes a new end-to-end assessment of all the known uncertainties associated with the current GISTEMP analysis (nominally based in the methodology described in Hansen et al., 2010, but with changes to data sources as documented on the GISTEMP website and outlined below), denoted as version 4. We use independently derived uncertainty models for the land station homogenization (Menne et al., 2010, 2018) and ocean temperature products (Huang et al., 2015, 2017), combined with our own assessment of spatial interpolation and coverage uncertainties, as well as parametric uncertainty in the GISTEMP methodology itself.

The analysis was performed in the open source language R (R Core Team, 2016) and the data, code, and intermediate steps needed to generate all figures in this report are available on the GISTEMP website (https://data.giss.nasa.gov/gistemp/uncertainty).

## 2 Overview of Surface Temperature Products

All of the most commonly cited surface temperature analyses split up the calculation of global anomaly fields into separate LSAT and SST anomaly analyses. These independent LSAT and SST analyses are combined into a total (LSAT and SST) global surface temperature index from which spatially averaged global and regional time series can be computed (note this is not strictly equal to the true surface air temperature anomaly; Cowtan et al., 2015). Likewise, the uncertainty analyses for the LSAT and SST are performed separately, then combined into total global uncertainty.

Semioperational surface temperature analyses have been available since the first products by National Aeronautics and Space Administration (NASA)/Goddard Institute for Space Studies Surface (GISS) and joint work from the Hadley Centre and Climatic Research Unit in the United Kingdom in the late 1970s. There are now multiple updated and peer-reviewed surface temperature products available, notably produced by NASA/GISS (GISTEMP), National Oceanic and Atmospheric Administration (NOAA) National Centers for Environmental Information (NCEI) with the Merged Land-Ocean Surface Temperature Analysis, the HadCRUT, an analysis from the Japanese Meteorological Agency (JMA; Ishihara, 2006) and a reanalysis-based product from European Centre for Medium-range Weather Forecasting. These analyses use considerably different methods for the calculation of historical global and regional mean time series but broadly agree on the trends and interannual variations in the global annual mean time series (Figure 1), though they differ at more regional scales as a function of data coverage and interpolation method (Rao et al., 2018). However, interpreting the comparisons across surface temperature products has to be nuanced since the raw data and intermediate product sources are often shared and not completely independent. Of the six major products that are currently being updated in real time, GISTEMP was notable in not having rigorous confidence intervals on the global and regional mean time series.

The treatment of missing land surface data is a major distinction between products. Since monthly temperature anomalies are strongly correlated in space, spatial interpolation methods can be used to infill sections of missing data. However, smoothing due to interpolation obscures spatial variability as grid box estimates are some weighted combination of many stations. HadCRUT4 performs the least interpolation. If a 5° × 5° grid box does not have any station data, this grid box is reported as missing (Morice et al., 2012). The HadCRUT method has the major advantage of clarity in that every grid box is the simple average of the station anomaly values contained in the grid box but suffers in coverage, particularly in the critical Arctic region. At the other extreme, GISTEMP performs the most interpolation by giving stations a 1,200-km radius of influence, regardless of latitude (Hansen et al., 2010). The interpolation allows for infilling during the data-poor early years (pre-1960), but makes it more complex to determine how stations contribute to grid box values. We expand on the GISTEMP method in the following section. The NOAA method performs an intermediate amount of interpolation by aggregating a 5° × 5° grid up to a 15° × 15° grid before modeling the fine-scale variability using an empirical orthogonal function teleconnection analysis as described in Appendix A of Smith and Reynolds (2005). The JMA method is similar to that of HadCRUT4. Comparisons to reanalyses products suggest that the interpolated products have less overall bias compared to the true global mean (Simmons et al., 2016) because the missing data areas are predicted (and seen) to be changing more than the global mean.

Recently, the Berkeley Earth group (Rohde et al., 2013a) and Cowtan and Way (2014) have released more statistically sophisticated products that confirm the observed warming in the NASA, NOAA, and HadCRUT products and provide a more natural uncertainty quantification. Berkeley Earth used an additive Kriging model for the LSAT analysis to estimate interpolated LSAT fields rigorously. Cowtan and Way took this approach a step further and used methods to interpolate both SST and LSAT fields used in HadCRUT. The results of Cowtan and Way suggest that the inclusion of interpolation is necessary to capture the global effect of the higher rate of warming in the Arctic.

## 3 Operational GISTEMP

The current operational method used in GISTEMP to compute the mean land surface temperature anomaly is an extended version of the process outlined by Hansen and Lebedeff (1987). The analysis contains two major steps: interpolation of individual station data and averaging of interpolated fields. Preliminary to the two core steps, the monthly station data are processed following Hansen et al. (2010). The publicly available code, written in Python, has been updated to modern standards (Barnes & Jones, 2011).

GISTEMP uses the equal-area grid developed in Hansen and Lebedeff (1987). The Earth is divided into 80 equal-area boxes arranged in bands of constant latitude. By constraining each box to cover the same area, the bands have unequal numbers of grid boxes resulting in an irregular grid. There are four bands in each hemisphere representing the polar region, midlatitudes, subtropics, and tropics which respectively contain 4, 8, 12, and 16 equal area boxes. Therefore, the bands account for 10%, 20%, 30%, and 40% of the area of the hemisphere. Each of the 80 boxes are divided into 100 equal-area subboxes resulting in an equal-area grid of 8,000 grid boxes covering the Earth.

### 3.1 Interpolation Step

*W*for a station

*d*km away from the subbox center within a given radius

*r*is determined using a linear radial basis function of the form

The value of the radius, *r*= 1,200 km, was estimated based on an investigation of the correlation of the annual mean series of pairs of stations as a function of their spatial separation (Hansen & Lebedeff, 1987); this simple device turned out to be quite similar to the form of the estimated covariance function in the modified Kriging method used by the Berkeley Earth analysis (Rohde et al., 2013b). If there are no stations within 1,200 km of a subbox center, it is given a missing value.

### 3.2 Averaging Step

The averaging step calculates the regional and global time series from the interpolated subbox records. In this context, regional refers to hemispheric and the eight latitudinal bands in the equal area grid. First, an average series is computed for each of the 80 equal area boxes by the method described in the interpolation step section, except that equal weight is given to each equal area subbox series. The LSAT and SST data are combined when each of the 80 box series are created. In each subbox, either a pure SST series or a pure LSAT series is selected. SST data are used only for ocean subboxes that contain no sea ice and whose center is more than 100 km off the nearest land station. Everywhere else we use the LSAT data.

The averages for the eight latitudinal zonal bands are then computed from the box series weighted by the number of subboxes with data. The three extratropical bands in each hemisphere are combined in the same way into a single series. These two series and the two tropical series are converted to anomaly series with respect to the 1951–1980 period. Global and hemispheric anomalies are computed as weighted averages of these four band means, weighted by the full area of these bands.

### 3.3 Changes to Operational GISTEMP 2010–2018

The only difference in methodology since Hansen et al. (2010) not caused by changes in the available input data, was combining into single polar boxes the 40 subboxes reaching the North and the South Poles (starting September 2016). This produced more natural looking images near the poles and insignificantly affected results.

All other changes relate solely to the input data. In 2010, GISTEMP was using GHCN-Monthly version 2 (GHCNv2), the U.S. Historical Climatology Network version 2.0 (USHCN2), and the Scientific Committee on Antarctic Research (SCAR) temperature data over land, with Hadley Centre Sea Ice and SST data set and Optimum Interpolation SST for the ocean. With the upgrade to GHCNv3 in December 2011 (and then to v3.2 in September 2012, and now to v4), the need for USHCN2 was obviated. In GHCNv3 as in GHCNv4, the various data series from different sources for a location, which were available in GHCNv2, are merged into a single series, and the resulting inhomogeneities are resolved in the adjustment procedure. Hence, GISTEMP is using the adjusted GHCNv3 and GHCNv4 data. Whereas combining different sources at a location and manual corrections are no longer needed, the GISS urban adjustment scheme is still being applied. For the ocean data, the ocean temperature product was replaced with the more homogeneous Extended Reconstructed SST (ERSST) v3b in January 2013, which was updated to ERSSTv4 in July 2015, and to ERSSTv5 in August 2017. The impacts over time of these changes are recorded and maintained on the GISTEMP History page https://data.giss.nasa.gov/gistemp/history.

Analyses subsequent to Hansen et al. (2010) that use GHCNv3 are now being denoted GISTEMP v3. The integration of GHCNv4 into the GISTEMP code in January 2019 is denoted as GISTEMP v4.0; this version does not use the SCAR data except as far as they are part of GHCNv4. Going forward, a more rigorous version numbering scheme will be adopted to better track methodological and input data variations. GISTEMP v3 will nonetheless be maintained for the time being for legacy purposes. The uncertainty analysis presented here is strictly valid for GISTEMP v4.0, but the differences with it applied to v3 are insignificant and primarily arise from differences in GHCN homogenization.

### 3.4 Prior Uncertainty Estimates

GISTEMP has previously presented uncertainties due to incomplete spatial coverage of the station record (Hansen & Lebedeff, 1987). Most recently, Hansen et al. (2010) reported estimates of this uncertainty for three large time periods: 1880–1900, 1900–1950, and 1960–2008. The analysis subsampled a long run of the GISS-ER climate model (Hansen et al., 2007) according to the coverage of the station network on the Earth during these three time periods. This model had a 4° × 5° latitude by longitude grid. Global annual land-only means of the subsampled model were compared with global annual land-only means using all of the grid boxes.

Since the global mean calculation in GISTEMP aggregates from small subboxes to the 80 equal-area boxes, the coarse model grid approach has considerable value in quantifying the large-scale sampling uncertainty in the approach assuming that the model is capturing sufficient statistical structure of the underlying fine-scale global temperature anomaly field. The uncertainty calculation also roughly captures large-scale spatial and temporal sparsity. An equal-area box that has no data within 1,200 km is “missing” in the GISTEMP global and regional mean calculation and is on the approximate scale of the model grid. Furthermore, the large grid box size of that model serves as a rough approximation of the interpolation step of the GISTEMP procedure.

We address a number of deficiencies in the legacy GISTEMP LSAT sampling uncertainty analysis in this study. The first goal is increasing the temporal resolution of uncertainty from around 50 years to decadal estimates of LSAT sampling uncertainty. Further refinements to the annual or even monthly timescale do not make a substantive difference. Second, we aim to better capture the uncertainty in the interpolation step of GISTEMP. The coarse resolution of the previously used model grid does not describe the fine-scale behavior of the true temperature anomaly field and does not allow us to replicate the interpolation step. As we detail in the following section, we now use a product with a much finer horizontal grid to replicate the entire GISTEMP global and regional mean calculation. Thus, we are more confident that our calculated uncertainties will reflect the actual analysis method used. Finally, we compare the uncertainties of the GISTEMP band averaging scheme with a simple latitude-weighted mean in the mean land surface temperature uncertainty.

The previously reported GISTEMP uncertainties do not include parametric uncertainties due to homogenization of the station record or uncertainties associated with the SST reconstruction. By adding in the homogenization uncertainty from the GHCN data set and propagating the uncertainty from the ERSSTv5 data set through the GISTEMP procedure, we obtain a holistic estimate of the full uncertainty in the GISTEMP product.

## 4 Sources of Uncertainty

### 4.1 Statistical Formulation of Uncertainty

*μ*(

*t*) be the true (latent) global anomaly for a year

*t*, we view the calculated (observed) annual mean temperature anomaly

*A*(

*t*) as

*ϵ*(

*t*) is a random variable that represents the total uncertainty in our estimate of the annual mean temperature anomaly. Assuming that our estimation procedure is unbiased (an assumption we will revisit), the expected value for all years

*t*. The uncertainty in our calculation of the global mean anomaly is then defined as

*ϵ*

_{L}(

*t*) and uncertainty in the global mean anomaly due to uncertainties in the sea surface calculation

*ϵ*

_{S}(

*t*). We decompose our total uncertainty as

We proceed on the assumption that the land and ocean uncertainties are independent. However, there is potentially correlation between the uncertainty due to the land calculation and the uncertainty due to the ocean calculation. In addition to correlation between the land and ocean uncertainties, we also expect some amount of correlation in time, particularly at the monthly time scale. Not accounting for positive correlation of uncertainties in time will lead to underestimation of the uncertainty. To reduce the impact of this autocorrelation, we look at the annual mean temperature anomalies which exhibit much lower autocorrelation.

### 4.2 Land Surface Temperature Uncertainty

Quantifying the uncertainties that arise from using the land station record to calculate regional and global land-only mean temperatures has been an active field for many years. In particular, NOAA (Vose et al., 2012) and HadCRUT (Morice et al., 2012) groups have developed sophisticated uncertainty models for this portion of the analysis. It is generally assumed that there are three major independent sources of uncertainty in the land record that add uncertainty to global temperature calculations: station uncertainty, bias uncertainty, and sampling uncertainty. We will outline these three briefly (though see Brohan et al., 2006 for a detailed discussion). As with the operational GISTEMP, we define the land surface as any grid box that is classified as land or sea ice.

#### 4.2.1 Station Uncertainty

Station uncertainty encompasses the systematic and random uncertainties that occur in the record of a single station and include measurement uncertainties, transcription errors, and uncertainties introduced by station record adjustments and missed adjustments in postprocessing. The random uncertainties can be significant for a single station but comprise a very small amount of the global LSAT uncertainty to the extent that they are independent and randomly distributed. Their impact is reduced when looking at the average of thousands of stations.

The major source of station uncertainty is due to systematic, artificial changes in the mean of station time series due to changes in observational methodologies. These station records need to be homogenized or corrected to better reflect the evolution of temperature. The homogenization process is a difficult, but necessary statistical problem that corrects for important issues albeit with significant uncertainty for both global and local temperature estimates.

#### 4.2.2 Bias Uncertainty

Bias uncertainty refers to the biases in a single station record due to nonclimatic sources. Thermometer exposure change bias (Parker, 1994) refers to biases introduced to the station record by the evolution of temperature measurement techniques, such as the switch to Stevenson screens in the nineteenth century or the change to Max-Min Temperature Sensor automated recorders in recent decades in the United States (Menne et al., 2009). Urban biases are not due to systematic biases in the instrumentation, but rather due the local warming effect of urban centers through land surface changes, reductions in evapotranspiration, and local heat sources. These urban biases are corrected for in our global temperature product since the goal is to understand the changes in the global climate system, not the localized effect of urban heat islands. An urban bias correction was added to GISTEMP in 1998 (Hansen et al., 1999); it confirmed that its impact on global temperature anomalies is small. As shown in Hansen et al. (2010), the effect of the urban adjustment on global temperature change is on the order of 0.01 °C.

#### 4.2.3 Sampling Uncertainty

Sampling uncertainty is an umbrella term for uncertainties introduced into global and regional annual means by incomplete spatial and temporal coverage. Whereas the station uncertainties are observed to mostly cancel out in modern-era global annual means, as many of the uncertainties are independent from station to station, the sampling uncertainties remain significant. Understanding the sampling uncertainty of GISTEMP is crucial because, unlike HadCRUT, GISTEMP extrapolates out the anomaly field into regions without station data. Quantifying the sampling uncertainty will provide a measure of confidence in the extrapolation. Since reduction in bias in the global mean due to interpolation comes with an uncertainty variance increase, we need to ensure that the interpolation does not drastically inflate the sampling uncertainty.

Quantifying the sampling uncertainty is critical to providing uncertainties for the mean temperatures for two reasons. First, the HadCRUT analysis has shown that the sampling uncertainty is a significant component of the uncertainty in the global annual means in the modern instrumental era (Morice et al., 2012). Second, updating the sampling uncertainty model provides transparent continuity in the GISTEMP analysis for numerous researchers that rely on the data product for their own analyses. As we will detail in the following section, GISTEMP has historically made only rough estimates of the sampling uncertainty. Our update here provides a transition from the original GISTEMP uncertainty model toward a more modern statistical approach.

### 4.3 SST Uncertainty

The current production versions of GISTEMP use the ERSSTv5 product provided by NOAA/NCEI (Huang et al., 2017) for ocean temperatures. ERSSTv5 uses the same underlying method (Huang et al., 2015) and uncertainty quantification method (Huang et al., 2016; Liu et al., 2015) as the previous generation ERSSTv4. The major upgrade in v5 is a more sophisticated parameter tuning, resulting in more realistic spatiotemporal patterns in the reconstructed SST fields. In addition, v5 incorporates new data sources from the International Comprehensive Ocean-Atmosphere Data Set 3.0 (Freeman et al., 2016) and the Argo float network of near-surface readings.

The uncertainty calculation in ERSSTv4/v5 breaks down the ocean uncertainty into two independent components: parametric uncertainty and reconstruction uncertainty (Huang et al., 2016; Liu et al., 2015). Parametric uncertainty quantifies the internal statistical variability of the ERSST procedure and is defined by the standard deviation of a perturbed parameter ensemble. The ensemble has been constructed such that the parametric uncertainty contains both the bias and sampling uncertainty (Huang et al., 2016). Reconstruction uncertainty represents the information lost in using a finite number of empirical orthogonal teleconnection functions to model the high-frequency component. Reconstruction uncertainty can be large at small spatial scales but averages out to nearly zero at global scales as seen in Figure 2c of Huang et al. (2016). Since we are concerned with global and hemispheric mean uncertainty in this study, it is reasonable to ignore the reconstruction uncertainty and focus only on the parametric uncertainty.

## 5 Update to GISTEMP's Uncertainty Analysis: Methods

### 5.1 Updated Land Surface Temperature Uncertainty Methodology

#### 5.1.1 Data Sources

GHCN*:* The primary data source for LSAT data in GISTEMP v4.0 is the monthly GHCN product from NOAA/NCEI. As mentioned in our discussion of the updates to operational GISTEMP in section 3.3, we have replaced the combined GHCNv3 and SCAR with GHCNv4 as of January 2019. Thus, we perform our LSAT uncertainty analysis using GHCNv4 but comment briefly on how the results apply to GISTEMP v3. GHCNv4 contains significantly more stations than GHCNv3/SCAR, though many of the additional time series are short. In general, the added stations in GHCNv4 do not significantly alter spatial coverage after interpolation and so will not effect the spatial uncertainty significantly, though it does slightly reduce some homogenization uncertainty. We compared the number of grid boxes in the Modern-Era Retrospective Analysis for Research and Applications (MERRA) model that contained a station with decadal coverage within the 1,200-km interpolation radius of influence (Figure 2) and find nearly no difference between versions. The increased quantity of stations will likely be most useful for more localized analyses.

*Reanalyses:* We use three distinct reanalysis products as globally complete “ground truth” temperature fields to quantify the contribution of the incomplete spatial and temporal coverage of the station record to the uncertainty in the global temperature anomaly. They are the fifth generation European Centre for Medium-range Weather Forecasting atmospheric reanalysis (ERA5), the JMA JRA-55 analysis (hereafter JRA), and MERRA-2 (hereafter MERRA). Since the legacy sampling uncertainty calculation inherently aggregates spatially due to large grid box size, it cannot utilize GISTEMP's interpolation method for the uncertainty analysis. In this study, we take a similar methodological approach using a high-resolution reanalysis product in place of the climate model output. The rough idea is the same: total coverage global means are compared with realistic (reduced) coverage global means and the uncertainty is described by summary statistics. The finer spatial resolution of the reanalyses than the climate model used previously allows us to treat single grid box temperature anomaly values as station anomalies. The combination of improved spatial resolution and an analysis more closely mirroring the production GISTEMP procedure will give us more robust calculations of the sampling uncertainty in the global mean.

The primary reanalysis used in our study is monthly ERA5 from 1979–2018 (Copernicus Climate Change Service (C3S), 2017). We average the 2-m temperature to the 0.5° × 0.625° MERRA grid to facilitate comparison and speed up computation. We find no significant changes to our results when verified on the native 31-km grid. We choose ERA5 as the primary reanalysis since it best replicates the observed global mean over its record. Furthermore, we find that ERA5 and JRA55 produce generally consistent results while results found using MERRA often deviate.

MERRA provides monthly temperature means for the entire Earth from 1980–2018 at a 0.5° × 0.625° resolution (Gelaro et al., 2017). The addition of the MERRA reanalysis is also due to the observational data sources used in assimilation. Since our goal is to determine the uncertainty that arises from the incomplete coverage of the GHCN station record, it is ideal to use a reanalysis that does not incorporate any GHCN information. Over land, MERRA only assimilates surface temperature data from the surface reading of the radiosonde network, ensuring that we are fitting our statistical model for GHCN sampling uncertainty over an independent data source (McCarty et al., 2016). We also verify all results with the JRA55 reanalysis over 1979–2013 (Kobayashi et al., 2015) to provide clarity when the results from MERRA and ERA5 disagree.

#### 5.1.2 LSAT Sampling Uncertainty Method

A grid box is determined to be land for the purpose of our study if its land area proportion is greater than 0% on the MERRA grid, approximately replicating the 100-km influence of land stations onto ocean grid cells in operational GISTEMP. As in operational GISTEMP, we determine sea ice extent for each month by the maximal extent of sea ice in the MERRA reanalysis. Grid boxes that are not classified as land are classified as ocean with uncertainty quantified by the SST uncertainty analysis.

Monthly temperature anomalies are computed for the entire reanalysis grid for each of the 12 months by removing the single month mean for each grid box time series. The full monthly temperature anomaly fields are used to calculate the baseline global and zonal annual means. We use a modified version of the GISTEMP averaging step with the same zonal bands and 80 equal area grid boxes and replace the subboxes with the reanalysis grid. The baseline global mean represents the true global anomaly *μ*(*t*), which will be compared with the mean anomalies calculated with reduced coverage *A*(*t*).

The spatial subsampling of the anomaly field is determined at a decadal temporal resolution. A station has temporal coverage in a decade if it has coverage for at least 5 of the 10 years. To have coverage for a year it must have coverage for at least three seasons, which requires at least 2 months in the season. A grid box is said to have coverage in a decade if it contains at least one station with coverage as defined above. Using these definitions, we create 14 decadal coverage masks on the grid, one for each decade from the 1880s to the 2010s. That is, we have a constant mask that describes the coverage of the observing network for each decade.

Reduced coverage global annual means, *A*_{k}(*t*), are calculated for each of the 14 decadal time periods using a modified GISTEMP procedure. In the notation *A*_{k}(*t*), *k* represents the decade used and *t* represents the year in the reanalysis record. The interpolation step is performed on the reanalysis grid using a radius of 1,200 km. Then the averaging step is performed as described in the baseline global mean calculation with the subboxes taken to be the area-weighted grid boxes. Thus, the baseline global mean is an annual time series indexed by *t* spanning 1980–2017. There are *k*=1,…,14 reduced coverage global means for the 14 decades of the study, each annual time series spanning from 1980–2017.

As the sampling uncertainty in ocean regions is quantified as part of the SST uncertainty analysis for ERSSTv5, we only include land area in the LSAT sampling uncertainty. The global and reduced coverage global land means are taken over land and sea ice regions following the GISTEMP procedure. Sea ice regions are defined using MERRA as the maximum extent of ice for each month over the reanalysis record.

*D*

_{k}(

*t*) for decade

*k*as

Then the uncertainty is Var(*D*_{k}(*t*)). Note that this method assumes that our method of calculating the global mean does not have any systematic bias.

#### 5.1.3 LSAT Sampling Extensions

Our sampling uncertainty analysis allows us to investigate other properties of the GISTEMP LSAT method. We describe three experiments addressed in our study. First, we challenge the assumption that the land surface mean temperature estimate is an unbiased estimate. Then, we calculate the minimum achievable sampling uncertainty due to the GISTEMP interpolation assuming full global station coverage. Finally, we provide one measure of the value of the GISTEMP averaging method.

*Sampling bias:*Recent studies have shown the likely presence of bias in surface temperature products compared to the true global mean (Cowtan & Way, 2014; Jones, 2016; Karl et al., 2015; Simmons et al., 2010, 2016). In addition, recent evidence from remote sensed temperature analyses suggest that production GISTEMP may be underestimating Arctic warming (Susskind et al., 2019). To quantify the potential sampling biases due to limited station coverage, we introduce a potential systematic additive bias

*α*

_{k}and multiplicative bias

*β*

_{k}. Then, determining the variance of

*ϵ*

_{k}can be formulated as the univariate regression

Since we are working with anomalies that are standardized over the entire time period of ERA5 (1979–2018), the additive bias *α*_{k}=0 for all decades as all of our grid box time series are mean zero. However, we fit the full linear regression as a sanity check as it will have practically no effect on our estimation of *β*_{k} or the uncertainty. Since the ERA5 reanalysis currently spans 1979–2018, only the estimates for the 1980s though 2010s are representative of potential bias in operational GISTEMP. The estimates for decades pre-1980 do not reflect the actual bias in GISTEMP during their periods as the underlying climate variability is not properly accounted for. However, the estimates of bias due to limited coverage in early decade are useful for understanding the importance of station coverage for capturing the current pattern of global temperature change.

*Limiting Uncertainty:* A lower bound of the sampling uncertainty is calculated by running the sampling uncertainty analysis in section 5.1.2 with the assumption that we have station coverage for every land grid box. We expect the limiting uncertainty to be greater than 0 as the smoothing arising from interpolation increases the uncertainty in the global mean. Calculation of the limiting uncertainty is important to determine the relative potential of increased data availability and methodological improvements for lowering the uncertainty of the global mean estimate. In addition to quantifying the lower uncertainty bound, we run the sampling bias analysis with the simulated full coverage to determine if the GISTEMP method has any systematic bias in an idealized case over the 1979–2018 period.

*Comparison of averaging methods:* In addition to using the GISTEMP band-average method, we run the sampling uncertainty analysis in section 5.1.2 using a simple latitude-weighted mean. Comparison of the resulting LSAT sampling uncertainties shows the difference between the two averaging methods in accounting for missing data.

#### 5.1.4 GHCN Homogenization Uncertainties

Station uncertainty due to homogenization of station series is quantified in the GHCNv4 analysis and incorporated in the GISTEMP uncertainty analysis with no modification (Menne et al., 2018). The GHCNv4 method divides the total homogenization uncertainty for land stations into two independent components: the parametric uncertainty associated with the Pairwise Homogenization Algorithm (PHA; Menne & Williams, 2009) used to homogenize the GHCNv4 monthly data and incomplete homogenization caused by artificial shifts in the data that remain undetected by the PHA.

The PHA detects artificial time series mean shifts due to changes in observing practice by comparing a station series with neighboring stations (Menne & Williams, 2009). Various parameters, such as the minimum number of neighboring stations, are set in implementing the PHA and affect the sensitivity and accuracy of the method. Parametric uncertainty is quantified by running the PHA as an ensemble whose members have randomly varying parameter settings from a set of configurations that produced the best results when run on realistic benchmark data sets (Williams et al., 2012). For GHCNv4 monthly, 100 different versions of the PHA were used to homogenize the GHCNv4 data, yielding 100 different homogenized versions of each GHCN station record (Menne et al., 2018). The parameter uncertainty is determined by the sample standard deviation of the 100 feasible records.

While the PHA detects large (>0.2 °C) breaks in time series, it (and other break-point detection methods) is unable to detect small shifts. This uncertainty associated with incomplete homogenization is estimated by adding small adjustments to the homogenization ensemble members at random dates and with random magnitudes. The frequency and magnitude of the added adjustments were determined by estimating the distribution of the missed (mostly small) breaks from the distribution of actual breaks detected by the PHA. Detected breaks in GHCNv4 have a bimodal distribution with peaks around ±0.5 °C. In between these peaks is the so-called “missing middle” of the distribution, which Menne et al. (2018) estimated as having a mean of about −0.01 °C and a standard deviation of 0.2 with an average frequency of occurrence of about 1 in 50 years. The number of missed adjustments for each station record in the ensemble was determined by sampling from a Poisson distribution with an average frequency of 1 in 59 years, and their magnitude was selected by a random draw from a normal distribution .

#### 5.1.5 Total Land Surface Temperature Uncertainty Methodology

As introduced in our discussion of sources of land uncertainty in section 4.2, land surface temperature uncertainty arises due to station, bias, and sampling uncertainties. Since the three sources are independent and we can ignore the bias uncertainty for means of large spatial scale, the total LSAT uncertainty is defined as the sum of the station homogenization and sampling uncertainties. As these uncertainties are expressed as variances, it is critical that the variance for the homogenization and sampling uncertainties are added rather than the standard deviations or confidence intervals.

### 5.2 SST Uncertainty Methodology

We use the uncertainty analysis from ERSSTv4 to quantify the uncertainty in the ocean temperature in the GISTEMP analysis as ERSSTv5 did not make any changes to the underlying reconstruction or uncertainty methods (Liu et al., 2015). ERSSTv4 quantified uncertainty through an ensemble of feasible SST fields rather than a single uncertainty field. The largest ensemble simulation contains 1,000 members and was constructed to quantify the parametric uncertainty in their prediction (Huang et al., 2016). Our analysis utilizes this 1,000-member large ensemble to understand how the uncertainty in the ERSST product impacts the GISTEMP uncertainty.

The parametric SST global and hemispheric uncertainty calculation closely follows the analysis performed by the ERSST team (Huang et al., 2016). We perform the GISTEMP averaging step with no land data for each of the 1,000 ensemble members resulting in 1,000 possible global and hemispheric time series. That is, we calculate the global mean with an ocean-only mask for each of the ERSST ensemble members. The 95% confidence interval for the parametric uncertainty of the SST model are calculated for each time point using the empirical 95% confidence interval of possible global mean SST.

Our assumption in this calculation is that the ERSST large ensemble is symmetric about the median for global and hemispheric means and that ERSSTv5 is the median value of the ensemble. Both of these assumptions are not perfect, but reasonable for these large-scale means. We find that the mean and median of the global SST mean ensemble are nearly identical. Furthermore, the strong agreement between the operational and ensemble global mean (and thus global median from our result) in Figure 12 of Huang et al. (2016) supports the assumption that the global uncertainty is symmetric.

### 5.3 Total Global Uncertainty Methodology

*t*and given an estimate of the global mean anomaly , we define the uncertainty of the global annual mean temperature as

The land-only uncertainty is composed of the sampling uncertainty calculated using the method described in section 5.1 with missing values for all of the ocean grid cells and the homogenization uncertainty according to the GHCNv4 analysis. Likewise, ocean-only uncertainty is calculated using the method described in section 5.2 with missing values for all of the land grid cells. The resulting uncertainties then describe the uncertainty over a subset of the area of the Earth.

*a*

_{L}and

*a*

_{S}(the area of the land and ocean on the Earth, respectively) and assuming that the land and ocean uncertainty components are independent, the total global uncertainty variance is

Hemispheric and other regional combined land and ocean uncertainties are calculated similarly.

Uncertainty values from products that are not operational are assumed constant for time periods after the end of their record. For the SST uncertainty, the ERSST ensemble was only issued through 2014. Thus, we use the 2014 value for years 2015–2018 and will update the analysis as more data become available. Likewise, the GHCNv4 homogenization was conducted through 2016 resulting in the 2017 and 2018 homogenization uncertainties being set to the 2016 value.

## 6 Results

### 6.1 LSAT Uncertainty Results

The sampling and total uncertainty in the global annual land surface mean temperature as calculated by each of the three reanalyses is shown in Figure 3a. As expected, increased number of stations and coverage of stations as time progresses results in decreasing sampling uncertainty over time. The three reanalyses are in general agreement with any differences in the sampling uncertainty shrinking in the total LSAT uncertainty. In the early decades of the study period, sampling uncertainty and homogenization uncertainty are of similar magnitude.

Figure 3b shows the LSAT as found with the ERA5 sampling uncertainty analysis. We will use the ERA5 analysis for the LSAT estimates in the remainder of the study. The homogenization component includes both the parametric uncertainty as well as uncertainties due to missed breaks. Approaching the present, the global sampling uncertainty decreases as the majority of the land has some station coverage, but the global homogenization uncertainty remains high. In particular, the major drop in sampling uncertainty in 1950–1970 occurs due to the inclusion of Antarctica. The relative lack of decrease in uncertainty in the global mean due to homogenization results from correction uncertainties in station records propagating forward in time (Menne et al., 2018). The minor contribution of sampling uncertainty to the total modern LSAT uncertainty illustrates how increasing coverage of temperature monitoring will not fix the uncertainty issue in the land surface temperature record.

The ERA5 analysis shows that the uncertainties in Hansen et al. (2010) were quite good with a slight overestimation of the 1960-present sampling uncertainty. In particular, we find nearly exact agreement over 1880–1900. The sampling uncertainty analysis also suggests that the GISTEMP annual mean time series may be extended to dates earlier than 1880 as is done in HadCRUT4 and Berkeley Earth, but not without suffering a large increase in sampling uncertainty, particularly if including data prior to 1870.

Separating the land uncertainty by hemisphere, we find that the Southern Hemisphere has greater sampling uncertainty due to the smaller proportion of land with station coverage (Figure 4). We again see the effect of Antarctica on the Southern Hemisphere through the reduction in sampling uncertainty from 1950–1970. The hemispheric homogenization uncertainties are slowly decreasing as in the uncertainty in the global mean with the exception of the large jump in Southern Hemisphere uncertainty in the mid-1920s, which can be explained by limited number of stations in the Southern Hemisphere available for comparison.

We further break down the sampling uncertainty analysis to the GISTEMP band level to determine the latitudinal regions where the station record may be unreliable. Figure 5 shows the time series for each of the eight latitudinal bands used in the GISTEMP analysis. The polar series confirm that these regions are driving the decrease in sampling uncertainty for both hemisphere.

Combining our improved total LSAT uncertainty with the GISTEMP land surface temperature time series gives a intuitive description of the certainty of the land warming trend over the modern record period. Figure 6 shows the LSAT time series from the operational GISTEMP analysis with confidence intervals according to the sampling and homogenization uncertainties. The magnitude in the trend is many times greater than the uncertainty at any period. Additionally, the uncertainty is much lower in the 1960 to present period in which much of the warming has occurred.

### 6.2 LSAT Extensions Results

#### 6.2.1 Sampling Bias Results

Since the results of the sampling bias assessment were not robust among reanalyses, we present the results for all three reanalyses in Figure 7. In general, the JRA55 and ERA5 products agree, with MERRA being an outlier. We find weak evidence for a warm bias due to samplingoutlier for sampling bias for the in-sample 1980 to present time period when using the ERA analysis. We also have the smallest confidence intervals of the three analyses for ERA5 demonstrating that the nonsignificance of the bias is a robust result.

As mentioned, the major caveat in the bias calculation is that the climate has been highly nonstationary over the past 150 years and we are calculating the bias due to a particular incomplete sampling using the climate changes over the ERA period of 1979–2018. That is, we are determining how good of a job a particular station arrangement could do at observing the climate change that has occurred from 1979–2018; a period in which we believe that the Arctic is warming faster than the rest of the land. In addition, we are making the assumption that the arctic temperature is changing at a fixed multiple of the global average. This assumption is reasonable as model studies have shown that modeling the amplification trend linearly is a reasonable choice over recent decades (Serreze & Barry, 2011; Cohen et al., 2014).

#### 6.2.2 Limiting Uncertainty Results

Running the sampling uncertainty analysis assuming perfect coverage suggests that 0.041°C is the limiting sampling 95% confidence interval for the annual mean LSAT anomaly in the GISTEMP method. In other words, adding additional station observations will not reduce the sampling uncertainty below this level. The current coverage is already quite close to this value as shown in Figure 8 implying that we are close to the limiting coverage for the GISTEMP model. Roughly speaking, the limiting uncertainty decreases with the amount of smoothing in the interpolation. As station coverage continues to improve, the choice of interpolation in GISTEMP should be revisited.

Our limiting sampling bias is found to be significant, albeit small. We find that the GISTEMP procedure overestimates the true global mean LSAT over the ERA5 record by 1.1% with a 95% confidence interval of (0.7%, 1.5%). A small limiting bias again suggests a reduction in the smoothing radius as full coverage is approached. In the context of the results in the previous section, we interpret production GISTEMP as being nearly unbiased, even in the pathological limiting case.

#### 6.2.3 Averaging Method Comparison Results

Figure 8 compares the LSAT sampling uncertainty from the simple latitude-weighted mean and GISTEMP band mean methods. The simple mean obtains comparable sampling error over the global land surface than the GISTEMP method from the 1870s onward. Both methods approach the liming LSAT uncertainty post-1960 as nearly global coverage is achieved. The result compares the performance of the GISTEMP averaging to the simple method over the land surface, but does not account for the difference in total uncertainty when large portions of SST data are missing such as much of the record in the rapidly warming arctic region

### 6.3 Ocean

The global uncertainty from the ERSST large ensemble using the GISTEMP averaging scheme resembles the global uncertainty calculated by the ERSST team. Similar uncertainty is expected as the GISTEMP averaging scheme converges to a latitudinal-weighted grid cell average as missing data approaches zero and the ERSST large ensemble has complete coverage of the oceans. The GISTEMP operational global annual average SST time series is shown in Figure 9. As in the LSAT global time series, the magnitude of the warming trend dominates the uncertainty of the calculation.

Looking at the hemispheric uncertainty in the annual SST anomaly, we see that there are minor differences between the two hemispheres (Figure 10). The larger uncertainty in the Southern Hemisphere post-1945 drives the global uncertainty as the Southern Hemisphere has double the area occupied by ocean compared to the Northern Hemisphere.

### 6.4 Total Global Uncertainty

We are now able to combine our total global uncertainty with the production GISTEMP global annual mean surface temperature anomaly time series. Figure 11 shows the production GISTEMP global time series with the 95% confidence interval calculated in this study. The confidence interval has been added to the distributed GISTEMP time series facilitating uncertainty quantification in studies that utilize the GISTEMP product. As in both the SST and LSAT time series, the warming signal is greater than the underlying uncertainty. We investigate the possible uncertainty of the signal in the following section.

As in the land and ocean analyses, we decompose the global uncertainty into Northern Hemisphere and Southern Hemisphere uncertainties (Figure 12). Following the larger land uncertainty and comparable ocean uncertainty, we see that the total uncertainty on the annual hemispheric mean is almost always larger in the Southern Hemisphere.

## 7 Discussion

Since the first GISTEMP estimates in the 1980s, there have been large increases in the amount of data ingested, improvements in the homogenization of station data to remove nonclimatic effects, and the incorporation of ocean data, but not much change to the global mean calculation methodology. These data changes have produced variations over time of the global annual mean record that, while not a controlled exploration, are indicative of the structural uncertainties in the product that arise indirectly through changes in data availability and processing. The new analysis presented here is far more complete, but it is appropriate that recent versions of GISTEMP fall within the uncertainties shown in Figure 11.

The improved assessment of uncertainty in the GISTEMP product is a function of three new developments: the Monte Carlo ensembles that have been done for the input data (ERSST and GHCN), the upgrading of the GISTEMP code base, and the evolving standards in uncertainty quantification in climate science. These threads have made the current study far more tractable than it would have been a decade ago.

The existence of the new uncertainty product now allows us to be more rigorous in assessing the strength of claims of records and trends in the data itself, but also to improve the propagation of that uncertainty into, for instance, detection and attribution exercises for constraining anthropogenic climate change.

One persistent question is whether it makes sense to extend the GISTEMP product prior to 1880, to perhaps as early as 1850 (for instance, to help estimate a nineteenth century baseline climatology; Hawkins et al., 2017). Figure 4.2 demonstrates that the sampling in the 1870s is not that much worse than the 1880s, but unfortunately, the homogenization analysis does not extend before 1880, and nor does the ERSST data. This is an issue we will continue to explore.

### 7.1 Probability of a New Warmest Year Record

The addition of the global annual mean uncertainty values calculated in this study to the widely distributed GISTEMP surface temperature product will enable users to include more informed probabilistic statements of uncertainty in their research. One such example is the probability of warmest year calculation which is often cited in scientific and popular literature.

Given the strong trend in global mean temperatures since the 1970s, NASA/GISS has frequently reported on new records for annual means over the instrumental period (11 times since 1988). This naturally leads to the question of how confident we can be in declaring that any particular record year in the GISTEMP index, was, in fact, the warmest year in the real world since 1880. Discussion of this uncertainty has been a focus of the NOAA and NASA annual briefings since 2014, which at the time was the warmest year in the record (NASA Public Affairs, 2015). With the major El Niño event in 2015/2016, both subsequent years were notably warmer (NASA Public Affairs, 2016, 2017), but how certain can we be of that?

We make a Monte Carlo estimate of the warmest year by determining which year has the highest temperature anomaly after either independent or autoregressive simulations of the uncertainties. The probability that a given year was the warmest year on record to date is then the number of simulations in which it is the warmest year divided by the total number of simulations. We use this method to reassess how well NASA's recent statements on the probability of warmest years match up to our updated uncertainty calculations.

In January 2015, NASA reported that 2014 was likely the warmest year with 38% likelihood (NASA Public Affairs, 2015) based on a simple assumption of linearly increasing uncertainty based on the Hansen et al. (2010) estimates. We now find that this was conservative and that 2014 actually had a 79% chance of truly being the warmest year in the instrumental period. Assuming autocorrelated uncertainties, this reduces slightly to 75% since the next most probable warmest years were nonconsecutive (2010 and 2005). The following year, NASA reported a likelihood that 2015 was the new record warmest year was 96%, which compares to a 99.9% probability calculated now (regardless of whether we use independent or autocorrelated uncertainties).

Assuming that uncertainties in the annual mean are independent from year to year, we find that 2016 is likely the warmest year in the last 139 (1880–2018) years with 87% certainty. The other years that could plausibly have been the warmest were 2017 (12% probability), 2018 (1% probability), and 2015 (<0.1% probability). While the GISTEMP-estimated mean global temperature is larger in 2015 than in 2018, the uncertainty in the 2018 mean is larger, primarily due to an increase in the LSAT homogenization uncertainty. Therefore, 2015 will rank higher on the warmest years than 2018 on average, but the additional uncertainly in the 2018 mean gives it a greater chance of being the warmest year.

We can also calculate this probability using autoregressive uncertainties. Unlike the uncertainty in temperature change, autoregressive uncertainties give more certainty to 2016 being the warmest year with a simulated 88% certainty. Since all of the candidate years have occurred in consecutive years, positive autocorrelation reduces expected difference in uncertainty.

While the AR(1) calculation is a reasonable choice for comparing anomalies over a short time period, such a calculation is not statistically sound for longer-term analyses using the uncertainties calculated in our study. Components of the uncertainty, particularly the homogenization uncertainty, persist over many decades reflecting large shifts in the record that propagate in time. These types of uncertainties are best represented in an uncertainty ensemble which has not yet been created for GISTEMP.

### 7.2 Comparison to Other Uncertainty Estimates

Two of the other products shown in Figure 1 have independently derived total uncertainties, specifically HadCRUT4 and Berkeley Earth. Figure 13 shows the comparison of the three 95% confidence intervals. The overall magnitudes are similar, with close agreement with the HadCRUT4 uncertainty pre-1945 and with Berkeley Earth post-1945. The character of the change around 1945 is driven primarily by the reduction in SST uncertainty in ERSST and reduction in the greater reduction in GISTEMP LSAT sampling uncertainty relative to HadCRUT4.

## 8 Conclusion

Our new uncertainty quantification of the global annual mean surface temperature anomaly in the GISTEMP product brings this analysis up to the enhanced standards of its peers and we hope that this will aid the interpretation and utility of this widely used product. This paper has focused on the global and hemispheric annual means, but the procedure can equally be used to improve the uncertainty analysis of regional and monthly data products and these will be pursued in further work.

## Acknowledgments

We thank both the reviewers and Editor for the insightful comments. Their suggested extensions led to interesting discoveries and a much stronger analysis and paper. The GISTEMP analysis is funded from grants from the NASA Modeling, Analysis and Prediction program in the Science Mission Directorate. N. L. was also supported by the National Science Foundation Graduate Research Fellowship under Grant NSF DGE 16-44869. Data sets from GHCN and ERSST are supported by NOAA's National Centers for Environmental Information. The Antarctic READER data are supported by SCAR. Thanks to Boyin Huang (NOAA) for access to the ERSST large parameter ensemble. Special thanks to the ClearClimateCode project, Nick Barnes and David R. Jones, for converting the original GISTEMP codebase to Python. The analysis was performed in the open source language R (R Core Team, 2016) and the data, code, and intermediate steps needed to generate all figures in this report are available on the GISTEMP website (https://data.giss.nasa.gov/gistemp/uncertainty).

##
References

## Erratum

In the originally published version of this article, equation (7) erroneously published as “*μ*(*t*) = *α*_{k} + *β*_{k}*A*_{k}(*t*) + *ɛ*_{k}(*t*)” instead of “*A*_{k}(*t*) = *α*_{k} + *β*_{k}*μ*(*t*) + *ɛ*_{k}(*t*).” As a result, Figure 7 published incorrectly, and the following changes were made to the text:

The first sentence in the third paragraph of 6.2.1 was changed from “The large and significant cool biases in the ERA and JRA reanalyses in the early record describe how undersampling the observed 1979 to present temperature change would lead to a biased calculation in the global mean” to “The robust large and significant cool biases in the early record describe how undersampling the observed 1979 to present temperature change would lead to a biased calculation in the global mean.”

The second sentence of the final paragraph of 6.2.2 was changed from “We find that the GISTEMP procedure overestimates the true global mean LSAT over the ERA5 record by 1.5% with a 95% confidence interval of (1.0%, 2.0%)” to “We find that the GISTEMP procedure overestimates the true global mean LSAT over the ERA5 record by 1.2% with a 95% confidence interval of (0.7%, 1.7%).”

In addition, an error was discovered by Dr. Alexandre Patriota in Functions.R that was used to process the data used in the originally published version of this article. Specifically, this was a coding error affecting the Southern Hemisphere LSAT sampling error calculation and a correction of the weighting used to compute the LSAT global mean. The authors have corrected this error and a description of the revisions is below. The conclusions of the paper are unaffected. The codebase and output on the public GISTEMP website will be corrected alongside full documentation of the changes to the results.

In the abstract, the sentence “We use the total uncertainties to estimate the probability for each record year in the GISTEMP to actually be the true record year (to that date) and conclude with 86% likelihood that 2016 was indeed the hottest year of the instrumental period (so far)” was changed to “We use the total uncertainties to estimate the probability for each record year in the GISTEMP to actually be the true record year (to that date) and conclude with 87% likelihood that 2016 was indeed the hottest year of the instrumental period (so far).”

In the third paragraph of section 6.1, “The ERA5 analysis shows that the uncertainties in Hansen et al. (2010) were quite good for the early record but overestimate the sampling uncertainty post-1950. In particular, we find nearly exact agreement over 1880–1900” was changed to “The ERA5 analysis shows that the uncertainties in Hansen et al. (2010) were quite good with a slight overestimation of the 1960-present sampling uncertainty.”

In the fourth paragraph of section 6.1, “Separating the land uncertainty by hemisphere, we find that the Southern Hemisphere has greater sampling uncertainty post-1920 coinciding with improved Northern Hemispheric coverage of the Arctic land and sea ice region (Figure 4)” was changed to “Separating the land uncertainty by hemisphere, we find that the Southern Hemisphere has greater sampling uncertainty due to the smaller proportion of land with station coverage (Figure 4).”

In the first paragraph of section 6.2.1, “We find no evidence for sampling bias for the in-sample 1980 to present time period when using the ERA analysis” was changed to “We find weak evidence for a warm bias due to sampling for the in-sample 1980 to present time period when using the ERA analysis.”

The last paragraph of section 6.2.1 was deleted. It read “The robust large and significant cool biases in the early record describe how undersampling the observed 1979 to present temperature change would lead to a biased calculation in the global mean. The approach of the estimates to unbiased mirrors the global coverage shown in Figure 2. The relationship between coverage and bias in estimating the 1979 to present warming makes sense, particularly because we know that station coverage in polar regions was limited or nonexistent in the early record and arctic temperature changed more rapidly over the past few decades.”

In the first paragraph of section 6.2.2, “Running the sampling uncertainty analysis assuming perfect coverage suggests that 0.034 C is the limiting sampling 95% confidence interval for the annual mean temperature anomaly in the GISTEMP method” was changed to “Running the sampling uncertainty analysis assuming perfect coverage suggests that 0.041 C is the limiting sampling 95% confidence interval for the annual mean LSAT anomaly in the GISTEMP method.”

In the second paragraph of section 6.2.2, “We find that the GISTEMP procedure overestimates the true global mean LSAT over the ERA5 record by 1.2% with a 95% confidence interval of (0.7%, 1.7%)” was changed to “We find that the GISTEMP procedure overestimates the true global mean LSAT over the ERA5 record by 1.1% with a 95% confidence interval of (0.7%, 1.5%).”

The paragraph in section 6.2.3 was changed from “Figure 8 compares the LSAT sampling uncertainty from the simple latitude-weighted mean and GISTEMP band mean methods. We find that the GISTEMP method almost always outperforms the simple method with the 1890s and 1900s being the only exceptions. Furthermore, we see that the GISTEMP method outperforms the simple method by up to 50% in the 1930s and 1940s, primarily due to the added arctic coverage providing better NH polar band estimates. The results demonstrate the value added by the GISTEMP averaging scheme leveraging the zonal correlation of temperature anomalies” to “Figure 8 compares the LSAT sampling uncertainty from the simple latitude-weighted mean and GISTEMP band mean methods. The simple mean obtains comparable sampling error over the global land surface than the GISTEMP method from the 1870s onward. Both methods approach the liming LSAT uncertainty post-1960 as nearly global coverage is achieved. The result compares the performance of the GISTEMP averaging to the simple method over the land surface, but does not account for the difference in total uncertainty when large portions of SST data are missing such as much of the record in the rapidly warming arctic region.”

The fourth paragraph of section 7.1 was changed from “We now find that this was conservative and that 2014 actually had a 79% chance of truly being the warmest year in the instrumental period. Assuming autocorrelated uncertainties, this reduces slightly to 75% since the next most probable warmest years were nonconsecutive (2010 and 2005). The following year, NASA reported a likelihood that 2015 was the new record warmest year was 96%, which compares to a 99.99% probability calculated now (regardless of whether we use independent or autocorrelated uncertainties)” to “We now find that this was conservative and that 2014 actually had a 79% chance of truly being the warmest year in the instrumental period. Assuming autocorrelated uncertainties, this reduces slightly to 75% since the next most probable warmest years were nonconsecutive (2010 and 2005). The following year, NASA reported a likelihood that 2015 was the new record warmest year was 96%, which compares to a 99.9% probability calculated now (regardless of whether we use independent or autocorrelated uncertainties).”

In the fifth paragraph of section 7.1, “Assuming that uncertainties in the annual mean are independent from year to year, we find that 2016 is likely the warmest year in the last 139 (1880–2018) years with 86.2% certainty. The other years that could plausibly have been the warmest were 2017 (12.5% probability), 2018 (1.2% probability), and 2015 (<0.1% probability)” was changed to “Assuming that uncertainties in the annual mean are independent from year to year, we find that 2016 is likely the warmest year in the last 139 (1880–2018) years with 87% certainty. The other years that could plausibly have been the warmest were 2017 (12% probability), 2018 (1% probability), and 2015 (<0.1% probability).”

In the sixth paragraph of section 7.1 “Unlike the uncertainty in temperature change, autoregressive uncertainties give more certainty to 2016 being the warmest year with a simulated 87.2% certainty” was changed to “Unlike the uncertainty in temperature change, autoregressive uncertainties give more certainty to 2016 being the warmest year with a simulated 88% certainty.”

The following figures were corrected (all that include SH sampling uncertainty): Figures 3, 4, 6, 7, 8, 9,10, 11, 12, 13.

The text and figures have been corrected, and this may be considered the authoritative version of record.