# Estimating error cross-correlations in soil moisture data sets using extended collocation analysis

## Abstract

Global soil moisture records are essential for studying the role of hydrologic processes within the larger earth system. Various studies have shown the benefit of assimilating satellite-based soil moisture data into water balance models or merging multisource soil moisture retrievals into a unified data set. However, this requires an appropriate parameterization of the error structures of the underlying data sets. While triple collocation (TC) analysis has been widely recognized as a powerful tool for estimating random error variances of coarse-resolution soil moisture data sets, the estimation of error cross covariances remains an unresolved challenge. Here we propose a method—referred to as extended collocation (EC) analysis—for estimating error cross-correlations by generalizing the TC method to an arbitrary number of data sets and relaxing the therein made assumption of zero error cross-correlation for certain data set combinations. A synthetic experiment shows that EC analysis is able to reliably recover true error cross-correlation levels. Applied to real soil moisture retrievals from Advanced Microwave Scanning Radiometer-EOS (AMSR-E) C-band and X-band observations together with advanced scatterometer (ASCAT) retrievals, modeled data from Global Land Data Assimilation System (GLDAS)-Noah and in situ measurements drawn from the International Soil Moisture Network, EC yields reasonable and strong nonzero error cross-correlations between the two AMSR-E products. Against expectation, nonzero error cross-correlations are also found between ASCAT and AMSR-E. We conclude that the proposed EC method represents an important step toward a fully parameterized error covariance matrix for coarse-resolution soil moisture data sets, which is vital for any rigorous data assimilation framework or data merging scheme.

## Key Points

- Triple collocation analysis is extended to an arbitrary number of data sets
- Extended collocation analysis allows for the estimation of error cross-correlations
- The method is evaluated using synthetic and real data

## 1 Introduction

Consistent global soil moisture records are essential for studying hydrology-driven phenomena of the Earth system such as climate change, vegetation growth, and many others [*Legates et al.*, 2011]. Various studies have shown the benefit of blending satellite-based soil moisture observations from multiple platforms into a unified data set [*Liu et al.*, 2011, 2012] or assimilating them into water balance models in order to generate a continuous merged (model/remote sensing) soil moisture analysis product [*Bolten and Crow*, 2012; *de Rosnay et al.*, 2013]. However, such merging and assimilation frameworks require an appropriate statistical parameterization of the error structures of both the land surface model and the remote sensing data, which is often difficult to obtain in practice. This error parameterization problem becomes even more challenging if errors between different input data sets are correlated as this requires the parameterization of error covariances (i.e., the off-diagonal elements of the error covariance matrix) in addition to error variances (i.e., the diagonal elements of the error covariance matrix).

In the past, off-diagonal elements in the error covariance matrix were commonly neglected as there was no method available for reliably estimating these elements [*Yilmaz et al.*, 2012]. At the same time, the increasing simultaneous availability of various active and passive satellite-based sensors (e.g., advanced scatterometer (ASCAT) onboard MetOp-A and MetOp-B, Soil Moisture Active Passive (SMAP), Soil Moisture Ocean Salinity (SMOS), Advanced Microwave Scanning Radiometer-EOS (AMSR-E), and AMSR2 inevitably leads to the need for a fully parameterized error covariance matrix, which is vital for any statistically rigorous attempt to merge multisource soil moisture retrievals into a unified data set [*Crow et al.*, 2015].

Triple collocation (TC) analysis [*Stoffelen*, 1998] has been widely recognized as a powerful tool for parameterizing the diagonal elements of the error covariance matrix [*Crow and Van den Berg*, 2010]. A first attempt to additionally estimate off-diagonal elements of the error covariance matrix was made by *Crow and Yilmaz* [2014] who analytically combined TC analysis with Kalman filter innovation analysis—referred to as Auto-Tuned Land Data Assimilation System (ATLAS)—yet the stability of the thereby obtained error cross-covariance estimates has not been proven over larger scales. More recently, *Crow et al.* [2015] proposed a TC-based approach to estimate off-diagonal elements by using lagged variables (i.e., temporally shifted representations of a particular data set) [*Su et al.*, 2014a] to generate data set triplets with uncorrelated errors, which can also provide consistent error variance estimates. Subtracting these estimates from error variance estimates obtained from a triplet using the corresponding data set together with two data sets that have correlated errors then yields an estimate of their error covariance. However, error cross-covariance estimates produced by this technique can become biased in the presence of temporal error auto-correlation [*Crow et al.*, 2015]. Another extension of TC that also tolerates the existence of nonzero error cross-correlations when using more than three data sets for the collocation was proposed by *Pan et al.* [2015]. It solves the collocation problem through Pythagorean constraints in Hilbert space, yet it does not yield estimates for nonzero error cross-correlations. Instead, it splits all considered data sets into so-called structural groups, within which the data sets are likely to have correlated errors. Random error variances of each data set in each group are then estimated as two components: One part that is correlated with the errors of the other data sets (within the same group), and the remaining part that is entirely independent from all other data sets (within all groups). Summing these two components up yields estimates for the individual total error variance of all data sets.

Here we propose an alternative method for estimating error cross-correlations by generalizing TC analysis to an arbitrary number of *N* > 3 data sets following *Zwieback et al.* [2012] and relaxing the assumption of zero error cross-correlation for a limited number of data set combinations. The resulting method is referred to as extended collocation (EC) analysis and allows for the estimation of a limited number of nonzero error cross-correlations—in addition to error variance and scaling coefficient estimates for all considered data sets—depending on the number of data sets used and their assumed underlying error structure. Of particular importance will be the estimation of error cross-correlation among different active satellite-based data sets (e.g., MetOp-A and MetOp-B ASCAT), among passive satellite-based data sets (e.g., SMOS, AMSR2, and WindSat), among data sets derived from the same sensor using different retrieval algorithms (e.g., SMOS L3 and SMOS LPRM), and among land surface models with similar atmospheric forcing (e.g., ERA-Land and Global Land Data Assimilation System (GLDAS)-Noah), all of which are simultaneously resolvable in the EC analysis framework.

For simplicity and without any loss of generality, the method will be discussed and demonstrated using maximum five data sets. Note that *Pierdicca et al.* [2015] recently proposed to extend TC analysis with a fourth data set and to solve this quadruple collocation (QC) problem as an overdetermined system of three possible triplets in a least squares sense. This minimizes the uncertainty of the individual error estimates but still requires uncorrelated errors between all four data sets. For the EC method proposed here we follow *Pierdicca et al.* [2015] in solving the collocation system of equations in a least squares sense in cases where the system remains overdetermined after additionally leveraging some degrees of freedom to estimate further parameters (i.e., error cross-correlations). It is worth mentioning that even though only soil moisture data sets are considered in this study, EC is—just like TC—also applicable to other geophysical variables in hydrometeorology and oceanography [e.g., *Vogelzang et al.*, 2011; *Caires and Sterl*, 2003; *Roebeling et al.*, 2012; *Fang et al.*, 2012].

The method will be derived in section 2. Section 3 shows an evaluation of the method using both synthetic identical twin experiments and a real data experiment.

## 2 Background

Our proposed EC method is a generalization of the well-known triple collocation (TC) analysis [*Stoffelen*, 1998], which is commonly used for estimating the individual error variances of three spatiotemporally collocated soil moisture data sets with mutually uncorrelated random errors [*Scipal et al.*, 2008; *Dorigo et al.*, 2010]. In the following sections we will derive the estimators using the so-called covariance notation for the collocation problem [*Stoffelen*, 1998; *Su et al.*, 2014b; *Gruber et al.*, 2015].

### 2.1 Triple Collocation

*i*∈[

*a*,

*b*,

*c*] representing three spatially and temporally collocated soil moisture data sets,

*Θ*is the true soil moisture state;

*α*

_{i}and

*β*

_{i}are additive and multiplicative biases in data set

*i*, and

*ϵ*

_{i}is zero-mean random noise. By using the error model in 1, the data set variances and covariances can be written as

*i*,

*j*∈[

*a*,

*b*,

*c*]. TC analysis assumes error orthogonality ( ) and zero error cross-correlation ( for

*i*≠

*j*). 2 thus simplifies to

with *i*,*j*,*k*∈[*a*,*b*,*c*] and *i* ≠ *j* ≠ *k*. These are the final error estimates obtained from TC analysis, which allow for either a direct investigation of the error variances (
), or for an investigation of the signal-to-noise ratios (SNR
) of the data sets. However, these estimates become biased in the presence of nonzero error cross-correlations and/or nonorthogonal errors, as it can be seen in 2. The first quantitative investigation of such biases due to violations in TC assumptions was recently made by *Yilmaz and Crow* [2014]. While results of this study suggest that both nonzero error cross-correlation and nonorthogonal errors may exist in typical soil moisture data sets, it was also found that the impact of error cross-correlations are of greater importance than that of error nonorthogonalities. This is because the impact of the latter can be dampened or even compensated if their magnitude is approximately equal for all data sets and also because errors of different data sets that are nonorthogonal are typically also cross correlated.

### 2.2 Extended Collocation Problem

*N*data sets [

*Zwieback et al.*, 2012] and relax the assumption of zero error cross-correlation for some data set combinations while maintaining the assumption of orthogonal errors for all data sets. According to 2, the data set covariances then write as

*i*≠

*j*. Cross covariances between errors in

*i*and

*j*can then be directly estimated from 5 as

with *i* ≠ *j* ≠ *k* ≠ *l* where
,
, and
are required to be zero. Error cross-correlations can be further derived by simply dividing 6 through the error standard deviations obtained using 4 applied on data set triplets with mutually uncorrelated errors, provided that they are available (see section 2.4).

Notice that 6 uses a combination of exactly four different covariances (between four data sets pairs), three of which are required to have uncorrelated errors. However, the availability of four data sets already provides six possible data set pairs (i.e., six different covariances), increasing with the number of data set sets (*N*) as
. Therefore, we can typically define a certain number of redundant estimators for
. The same holds for the signal and error variance estimates, i.e., for the
and
obtained from 4, which require three data set pairs, all of which must have uncorrelated errors. This redundancy allows us to solve the EC problem in a least squares sense in order to reduce estimation uncertainties in the error variance and covariance estimates [*Su et al.*, 2014a; *Pierdicca et al.*, 2015].

### 2.3 Least Squares Solution

**y**=

**A**

**x**;

**y**is the (known) observation vector,

**A**is the design matrix, and

**x**is the vector of unknown parameters. The actual dimensions of

**y**,

**A**, and

**x**depend on the number of data sets used and on the number of data set pairs which are (a priori) assumed to have correlated errors. This also determines the degree of redundancy in

**A**

**x**. As an example, for the case of four data sets—referred to as the quadruple collocation (QC) scenario with

*i*,

*j*,

*k*,

*l*∈[

*a*,

*b*,

*c*,

*d*]—with only

*a*and

*b*having correlated errors ( ), 8 takes the form

**x**is then given as

Notice that the QC case (*N* = 4) with only one nonzero error cross-correlation was chosen merely as an example for demonstration purposes. Equation 9 can be easily extended to any number of *N* > 4 data sets, which allows also for the estimation of more than one nonzero error cross-correlation, for example, between multiple active satellite-based and multiple passive satellite-based soil moisture data sets. However, regardless of the number of data sets used in EC analysis, not every possible error structure is resolvable.

### 2.4 Resolvable Error Structures

In 6 we see that the consistency of the error cross-covariance estimator requires zero error cross covariance between some specific data set combinations, i.e.,
. The same holds for the signal- and error variance estimators in 4, which require
,
, and
to be zero. If any of these were allowed to be nonzero, the matrix
would become singular and the collocation system of equations in 10 cannot be solved. However, regardless of the number of data sets used, we can define the requirement on the invertability of the matrix
as follows: Each member of the data set pairs with cross-correlated errors must also be a member of at least one data set triplet with mutually uncorrelated errors. For example, when using two passive satellite-based data sets—which are those assumed to have correlated errors—together with one active satellite-based and one modeled data set, we can define two triplets composed of the active microwave based, the modeled, and one passive satellite-based data set, respectively, both of which have fully independent error structures. In this case,
can be inverted, and the collocation system of equations can be solved. More generally speaking,
has to have full rank. Therefore, the rank of
(and thus also of **A**) has to be equal to the size of **x**.

## 3 Demonstration

In the following sections we will evaluate the EC method using both synthetic identical twin experiments and a real data analysis. For simplicity and without any loss of generality we will limit the demonstration to scenarios where either four or five data sets are available.

### 3.1 Synthetic Experiment

For the synthetic experiment we limit the number of data sets (*N*) to *N* = 4 (i.e., to the QC scenario) with only one data set pair having cross-correlated errors. This represents the worst case (in the synthetic case) since the inclusion of more data sets would increase the degrees of freedom in the collocation system of equations, which would lead to an increased precision of the estimates.

A true soil moisture reference *Θ* is first generated via an unperturbed integration of the Antecedent Precipitation Index model (*Θ*_{t}=*γ**Θ*_{t − 1}+*P*_{t}; where *t* is the time index, the loss variable *γ* is held fixed at 0.85, and the precipitation *P* is modeled as a Possion process [*Crow et al.*, 2012a]). Four soil moisture data sets are then generated by artificially perturbing the soil moisture reference with random noise containing varying cross-correlations, drawn from a multivariate normal distribution.

Synthetic soil moisture quadruplets are generated for a large number of different cases. Error cross-correlation levels between two of the data sets are systematically varied between 0.0 [-] and 1.0 [-] in increments of 0.1 [-], and error variance levels are varied in all four data sets between 40 mm^{2} and 600 mm^{2} in increments of 80 mm^{2}, which corresponds to a SNR between about −6 dB and +6 dB, which is a typical range for soil moisture data sets [*Gruber et al.*, 2015]. Altogether, this requires the generation of 45,056 separate synthetic data sets. The sample size of each data set is 750 days which is approximately the average sample size that is available for the real data experiment (see section 3.2). The EC based error cross-correlation estimates for these 45,056 data sets—obtained using 10—are shown in Figure 1. True error cross-correlation levels can be recovered without bias and with negligible root-mean-square error (RMSE) (0.08 [-]), which decreases with increasing error cross-correlation. Therefore, the application of EC for accurately estimating error cross-correlations appears plausible. Note that the apparent increase in estimation accuracy with increasing error cross-correlation magnitude originates from the nonlinear (…)^{−1} dependency on error variance estimates when converting the error cross-covariance estimates to error cross-correlations. The uncertainties of the error cross-covariance estimates alone do not show such a dependence.

### 3.2 Real Data Experiment

In this section we further evaluate the EC method by applying it to real data. The soil moisture data sets used for this study are (i) passive satellite-based retrievals from the AMSR-E C-band channel, (ii) passive satellite-based retrievals from the AMSR-E X-band channel, (iii) active satellite-based retrievals from ASCAT, (iv) soil moisture estimates from the GLDAS-Noah land surface model, and (v) ground measurements from globally distributed in situ stations drawn from the International Soil Moisture Network.

While active-based, passive-based, modeled, and in situ soil moisture estimates are widely assumed to have mutually independent error structures, the two AMSR-E data sets from two different frequency channels are very likely to have significant nonzero error cross-correlation due to instrumental and algorithmic identity. Here we use EC analysis to estimate these supposed error cross-correlations between multifrequency AMSR-E retrievals and further test the assumption of zero error cross-correlation between AMSR-E and ASCAT retrievals.

Soil moisture estimates from AMSR-E are retrieved using the Land Parameter Retrieval Model (LPRM) version 5 [*Owe et al.*, 2008] and provided by the Vrije Universiteit Amsterdam. Data are provided in volumetric units on a regular grid with 0.25° grid spacing. Vegetation Optical Depth estimates are used to filter out retrievals with a high uncertainty due to dense vegetation [*Parinussa et al.*, 2011]. Usually, Radio Frequency Interference (RFI) estimates are used to switch from C- to X-band retrievals in RFI-contaminated areas [*Owe et al.*, 2008]. Here we consider both C- and X-band retrievals separately in order to estimate their mutual error cross-correlation. RFI estimates are used to mask out areas with high contamination in either of the frequency bands.

The active satellite-based soil moisture data set is the H-25 SM-OBS-4 MetOp-A ASCAT time series product, retrieved using the TU Wien algorithm version WARP 5.5 R2.2 [*Wagner et al.*, 1999; *Naeimi et al.*, 2009]. ASCAT operates at C-band, retrieved soil moisture estimates are provided as degree of saturation at a spatial resolution of 25 km, regridded to a 12.5 km Discrete Global Grid. The WARP Surface State Flag [*Naeimi et al.*, 2012] is used to remove measurements taken under frozen or freezing/thawing conditions.

The Global Land Data Assimilation System (GLDAS-) Noah model provides soil moisture data for four different depth layers at a spatial resolution of approximately 0.25° in a 3-hourly sampling rate [*Rodell et al.*, 2004]. Only the top layer (0–10 cm) is used in this study.

In situ data is drawn from the International Soil Moisture Network (ISMN), which is a data hosting facility that collects and harmonizes data from networks and field validation campaigns worldwide, and makes them available to the users on a centralized web platform [*Dorigo et al.*, 2011a, 2011b]. For this study we consider all stations that lie within the temporally overlapping period of ASCAT and AMSR-E, i.e., January 2007 to October 2011. Measurements from sensors which are placed deeper than 10 cm below the surface are excluded. The ISMN also flags suspicious measurements such as spikes or signal saturation as well as measurements taken under frozen conditions or exceeding physically meaningful value ranges, based on automated quality control procedures [*Dorigo et al.*, 2013]. Measurements flagged as suspicious are excluded in this study. Data sets that meet the above described requirements are provided by the networks: AMMA-CATCH [*Pellarin et al.*, 2009], ARM (http://www.arm.gov/), COSMOS [*Zreda et al.*, 2008], GTK, HOBE [*Bircher et al.*, 2012], ICN [*Hollinger and Isard*, 1994], MAQU [*Su et al.*, 2011], MOL-RAO (http://www.dwd.de/mol/), OZNET [*Smith et al.*, 2012], PBO-H_{2}O [*Larson et al.*, 2008], REMEDHUS (http://campus.usal.es/∼hidrus/), SASMAS [*Young et al.*, 2008], SCAN (http://www.wcc.nrcs.usda.gov/), SMOSMANIA [*Albergel et al.*, 2008], SNOTEL [*Leavesley et al.*, 2008], SWEX-POLAND [*Marczewski et al.*, 2010], UDC-SMOS [*Schlenz et al.*, 2012], UMBRIA [*Brocca et al.*, 2011], USCRN [*Bell et al.*, 2013], and USDA-ARS [*Jackson et al.*, 2010].

#### 3.2.1 EC Analysis Over the ISMN

As mentioned in section 2.4, EC requires at least two data sets whose errors are fully independent from the errors of all other data sets in addition to the data sets with assumed nonzero error cross-correlation. Therefore, both modeled and in situ data need to be included in the EC analysis when assuming nonzero error cross-correlations between ASCAT and AMSR-E. However, this results in spatially incomplete estimates due to the limited global coverage of available ground stations.

Figure 2 shows the error cross-correlation statistics between retrievals from the two AMSR-E channels, between ASCAT and AMSR-E C-band retrievals, and between ASCAT and AMSR-E X-band retrievals, respectively, for both absolute values (median: 0.82/0.27/0.25) and anomalies (median: 0.78/0.21/0.20) for all available stations. Anomalies were calculated by subtracting a 5 week moving-average window-based climatology. Figures 3 and 4 further show the spatial distribution of error cross-correlation over regions with a higher station coverage, i.e., the Contiguous United States, Europe, and New South Wales (Australia) for absolute measurements and anomalies, respectively. As expected, cross-correlations between the errors of the AMSR-E data sets are very high in almost all regions. A detailed discussion on AMSR-E error cross-correlation will be provided later in section 3.2.3. Against expectation, nonzero error cross-correlations exist—even though much lower—also between ASCAT and both AMSR-E frequency channels. These are slightly higher for absolute soil moisture retrievals than for anomalies and show some distinct spatial patterns: higher error cross-correlations over the western U.S., which are more pronounced for absolute values than for anomalies, higher cross-correlations between errors of absolute values over the Mississippi region, which are not present in the anomalies, and higher values over agricultural areas in Australia for both absolute values and anomalies.

Most of the observed nonzero error cross-correlations seem to be located in areas where in situ stations typically have a limited spatial representativeness, for instance, in the western U.S. where the topographic complexity is very high, or in the heavily irrigated Mississippi region. Therefore, the question arises whether the observed error cross-correlations in these regions are artificial biases due to limited representativeness of the ground measurements. In classical TC analysis, limited spatial representativeness causes a bias in the error variance estimates of the ground measurements, i.e., TC assigns them an additional representativeness error term [*Vogelzang and Stoffelen*, 2012; *Miralles et al.*, 2010; *Crow et al.*, 2012b; *Gruber et al.*, 2013, 2015]. The error variance estimates of the coarse-resolution data sets, on the other hand, remain unbiased. In the following section we investigate the impact of representativeness errors on error cross-correlation estimates in EC analysis analytically.

#### 3.2.2 Representativeness Errors in EC Analysis

*Gruber et al.*[2015], we can split the observed soil moisture signal

*Θ*into a joint signal component

*Θ*

_{j}, which is observed by all data sets, and a coarse-scale component

*Θ*

_{c}, which is observed by the coarse-resolution data sets only. Let us now consider four data sets

*a*,

*b*,

*c*, and

*d*, where

*a*represents a point-scale in situ data set and the others represent data sets with comparable coarse spatial resolution—such as, for instance, GLDAS-Noah, ASCAT, and AMSR-E—with the errors between data sets

*c*and

*d*being correlated. The covariances between the data sets then write as

From 11 we can see that the error cross-covariance estimators in 6 remain unbiased. That is, even though nonzero error cross-correlations between ASCAT and AMSR-E are observed mainly in areas where in situ stations are expected to have limited representativeness, these representativeness errors should not induce biases in error cross-correlation estimates. Instead, the same phenomena that decrease spatial representativeness of point measurements, i.e., highly localized soil moisture variations, might also induce correlations between retrieval errors of different satellites.

#### 3.2.3 Global EC Analysis

In section 3.2.1, spatially limited in situ data were required to estimate error cross-correlations between the errors of ASCAT and AMSR-E. Here we will exclude the in situ data from EC analysis in order to estimate error cross-correlations between the AMSR-E products globally, yet it requires the assumption of zero error cross-correlation between ASCAT and AMSR-E. Even though we found that this assumption is not always fulfilled, observed nonzero error cross-correlations between ASCAT and AMSR-E are, in general, rather low (median ≈0.25) compared to those between the two AMSR-E products (median ≈0.8). Therefore, keeping a possible violation in mind, we will assume the cross-correlations between ASCAT and AMSR-E to be negligible.

Figure 5 shows the error cross-correlation estimates between the C- and X-band soil moisture retrievals from AMSR-E for both absolute values and anomalies. White shading indicates areas where estimates did not converge to a meaningful value (i.e., where the cross-correlation estimate was below −1.0 or above 1.0 [-]). As already observed in the in situ analysis, very high error cross-correlations exist in most regions. The 5, 25, 50, 75, and 95% quantiles are 0.42, 0.76, 0.87, 0.92, and 0.97 [-] for absolute values, and 0.16, 0.70, 0.82, 0.90, and 0.99 [-] for anomalies, respectively. As mentioned before, these error cross-correlation estimates might be biased due to the presence of nonzero error cross-correlations between ASCAT and AMSR-E. However, the average value ranges are comparable to those obtained in section 3.2.1, where globally distributed in situ measurements were included as a fifth data set in EC analysis so that the error cross-correlation estimates for the AMSR-E products remain unaffected by nonzero error cross-correlations between ASCAT and AMSR-E. This suggests that the possible biases in the AMSR-E C- and X-band error cross-correlation estimates from the global EC analysis presented in this section are largely negligible.

Clear spatial patterns exist which suggest that the method is not overly sensitive to estimation noise, which is expected given the large number of temporally matching observations (median: 781). Likely drivers for these apparent error cross-correlation patterns are the differing spatial resolution and penetration depth of the two AMSR-E frequency channels, their differing sensitivity to vegetation, topographic complexity, and possibly also other land cover features, and—most importantly—radio frequency interference (RFI). Indeed, regions with low-error cross-correlation show good agreement with regions where RFI is expected [*de Nijs et al.*, 2015]: C-band RFI contamination is expected mainly in the U.S., the Middle East and Japan, whereas X-band RFI is expected mainly over England and Italy. RFI in both frequencies is also expected in Europe, especially around densely urbanized areas. In most of these regions, also lower error cross-correlations are observed. This good agreement is a first indicator for the reliability of EC error cross-correlation estimates. However, additional validation is required before the approach can be applied with full confidence.

## 4 Summary and Outlook

A method for estimating error cross-correlations between soil moisture data sets was developed by generalizing the well-known triple collocation (TC) analysis to an arbitrary number of data sets and relaxing the assumption of nonzero error cross-correlation for some data set combinations, referred to as extended collocation (EC) analysis. The number of allowed nonzero error cross-correlations between data set pairs is mainly limited by the overall number of data sets used and by their underlying error cross-correlation structure: Each member of the data set pairs with assumed nonzero error cross-correlation must also be a member of at least one data set triplet with fully independent errors. Furthermore, remaining degrees of freedom can be used to solve the collocation system of equations in a least squares sense.

The proposed EC method was evaluated using both a synthetic identical twin experiment and real data experiments. In the synthetic experiment, EC analysis was able to recover true error cross-correlation levels with an average root-mean-square error of 0.08 [-] and a negligible bias. In the real data experiments EC analysis was applied to satellite-based soil moisture retrievals from ASCAT, the AMSR-E C-band channel, the AMSR-E X-band channel, modeled soil moisture estimates from GLDAS-Noah, and in situ soil moisture measurements drawn from the International Soil Moisture Network. Results suggest that significant error cross-correlations exist between the AMSR-E C-band and X-band channels (median = 0.82 and 0.78 [-] for absolute values and anomalies, respectively), which are likely driven by their differing spatial resolution, sampling depth, sensitivity to vegetation and other land cover features, and—most importantly—RFI. Moreover, slight nonzero error cross-correlations were found also between ASCAT and AMSR-E (median = 0.25 and 0.20 [-] for absolute values and anomalies, respectively). These nonzero error cross-correlations may slightly bias the error cross-correlation estimates between the AMSR-E C- and X-band channels.

It should be emphasized that—even though only demonstrated for four and five data sets—the EC method presented in this study is readily applicable to an arbitrary number of data sets, which would facilitate the estimation of more nonzero error cross-covariance terms (e.g., when using three passive data sets such as SMAP, AMSR2, and SMOS together with two active data sets such as MetOp-A and MetOp-B). Therefore, it represents an important step toward a fully parameterized error covariance matrix which is vital for any rigorous data assimilation framework or data merging scheme.

## Acknowledgments

We thank Robert Parinussa and Alexandra Konings for valuable discussions. The used AMSR-E and GLDAS-Noah data are available from the NASA Goddard Space Flight Center (ftp://hydro1.sci.gsfc.nasa.gov/data/s4pa_TS2/WAOB/; ftp://hydro1.sci.gsfc.nasa.gov/data/s4pa_TS2/GLDAS_V1/). ASCAT data are available from EUMETSAT's H-SAF data portal (http://hsaf.meteoam.it/soil-moisture.php). In situ data used in this study are provided by the AMMA-CATCH, ARM, COSMOS, GTK, HOBE, ICN, MAQU, MOL-RAO, OZNET, PBO-H_{2}O, REMEDHUS, SASMAS, SCAN, SMOSMANIA, SNOTEL, SWEX-POLAND, UDC-SMOS, UMBRIA, USCRN, and USDA-ARS networks and are available from the ISMN data portal (http://ismn.geo.tuwien.ac.at/data-access/). This study was carried out within the eartH2Observe project (European Union's Seventh Framework Programme, grant 603608) and with support under The University of Melbourne's Early Career Researcher Grant Scheme.