Volume 117, Issue B2
Seismology
Free Access

Transdimensional inversion of receiver functions and surface wave dispersion

T. Bodin

T. Bodin

Research School of Earth Sciences, Australian National University, Canberra, ACT, Australia

Search for more papers by this author
M. Sambridge

M. Sambridge

Research School of Earth Sciences, Australian National University, Canberra, ACT, Australia

Search for more papers by this author
H. Tkalčić

H. Tkalčić

Research School of Earth Sciences, Australian National University, Canberra, ACT, Australia

Search for more papers by this author
P. Arroucau

P. Arroucau

Environmental, Earth and Geospatial Sciences, North Carolina Central University, Durham, North Carolina, USA

Search for more papers by this author
K. Gallagher

K. Gallagher

Géosciences Rennes, Université de Rennes 1, Rennes, France

Search for more papers by this author
N. Rawlinson

N. Rawlinson

Research School of Earth Sciences, Australian National University, Canberra, ACT, Australia

Search for more papers by this author
First published: 03 February 2012
Citations: 294

Abstract

[1] We present a novel method for joint inversion of receiver functions and surface wave dispersion data, using a transdimensional Bayesian formulation. This class of algorithm treats the number of model parameters (e.g. number of layers) as an unknown in the problem. The dimension of the model space is variable and a Markov chain Monte Carlo (McMC) scheme is used to provide a parsimonious solution that fully quantifies the degree of knowledge one has about seismic structure (i.e constraints on the model, resolution, and trade-offs). The level of data noise (i.e. the covariance matrix of data errors) effectively controls the information recoverable from the data and here it naturally determines the complexity of the model (i.e. the number of model parameters). However, it is often difficult to quantify the data noise appropriately, particularly in the case of seismic waveform inversion where data errors are correlated. Here we address the issue of noise estimation using an extended Hierarchical Bayesian formulation, which allows both the variance and covariance of data noise to be treated as unknowns in the inversion. In this way it is possible to let the data infer the appropriate level of data fit. In the context of joint inversions, assessment of uncertainty for different data types becomes crucial in the evaluation of the misfit function. We show that the Hierarchical Bayes procedure is a powerful tool in this situation, because it is able to evaluate the level of information brought by different data types in the misfit, thus removing the arbitrary choice of weighting factors. After illustrating the method with synthetic tests, a real data application is shown where teleseismic receiver functions and ambient noise surface wave dispersion measurements from the WOMBAT array (South-East Australia) are jointly inverted to provide a probabilistic 1D model of shear-wave velocity beneath a given station.

Key Points

  • Novel scheme for joint inversion of receiver function
  • Transdimensional algorithm where the number of layers is an unknown
  • Bayesian formulation correctly accounts for data and model uncertainties

1. Introduction

[2] The coda of teleseismic P-waves contains a large number of direct and reverberated phases generated at interfaces beneath the receiver that contain a significant amount of information on seismic structure. However, these phases are difficult to identify as they are buried in micro-seismic and signal-generated noise [Lombardi, 2007]. The signal to noise ratio is usually improved by stacking seismograms from different records from a single station, but a major drawback is the introduction of different source time functions generated by multiple earthquakes. This problem is overcome by a method developed in the 1970s now widely used in seismology and follows the pioneering work by Phinney [1964]. The idea is to deconvolve the vertical component from the horizontal components to produce a time series called a “receiver function” (RF) [Langston, 1979]. In a receiver function the influence of source and distant path effects are eliminated, and hence one can enhance conversions from P to S generated at boundaries beneath the recording site.

[3] The RF waveform can be inverted in the time domain for a 1D S-wave velocity model of the crust and uppermost mantle beneath the receiver. In this paper we present a novel RF inversion methodology where the number of layers defining the velocity model as well as the variance and correlation of data noise are treated as unknowns in the problem. We also show how independent data of different character and with different sensitivities (e.g. surface wave dispersion measurements) can be included in a consistent manner. The result is a general probabilistic joint inversion methodology, where no explicit “tuning” is needed to weight different data sets.

1.1. A Brief History of RF Inversion

[4] The RF inverse problem is highly non-linear and non-unique [Ammon et al., 1990]. Owens et al. [1984] carried out an iterative linearized inversion where partial derivatives were computed numerically with a finite difference scheme. The inversion was stabilized with truncation of small eigenvalues after singular value decomposition of the system of equations. Kosarev et al. [1993] and Kind et al. [1995] used a linearized Tikhonov inversion and stabilized the algorithm by penalizing solutions far from a given reference model. It is well known that linear inversion procedures based on partial derivatives are easily trapped by local minima, and hence solutions may be strongly dependent on initial models.

[5] As increased computational power became available, Monte Carlo parameter search methods became a practical alternative for RF inversion. These include global optimization techniques such as genetic algorithms [Shibutani et al., 1996; Levin and Park, 1997; Clitheroe et al., 2000; Chang et al., 2004], niching genetic algorithm [Lawrence and Wiens, 2004], simulated annealing [Vinnik et al., 2004, 2006], very fast simulated annealing [Zhao et al., 1996], and also the neighborhood algorithm [Sambridge, 1999a; Bannister et al., 2003; Nicholson et al., 2005]. These techniques are able to efficiently search a large multidimensional model space and provide complex Earth models that minimize a misfit measure without need of linearization or computation of derivatives. They were applied in the hope that the solution avoided entrapment in local minima of the objective function.

[6] The inherent non-uniqueness of RF inversion means that two Earth models that are far apart in the model space (i.e which have different parameter values) can provide a similar level of data fit. Non-linear optimization algorithms, to varying degrees, are able to search a large parameter space and find global minima, however they usually only provide a single solution, i.e. the best one in some sense. This leaves open the possibility that other Earth models, which are far from this solution, might also fit the data within errors. Hence a single solution is often not representative of the information contained in the data. To reduce dependence on single “best fit” models, global optimization techniques have been used to perform an ensemble inference, where one obtains an ensemble of models satisfying some predefined criteria (e.g. the best 1000 data fitting models generated by the algorithm). This ensemble of “acceptable” models are thus plotted together for visualization [e.g., Piana Agostinetti et al., 2002; Reading et al., 2003; Hetényi and Bus, 2007].

[7] However, the ensemble obtained in this way is rather arbitrary and there is no guarantee that these models are representative of all acceptable models. Furthermore, the statistical distribution of models within the ensemble generally does not represent the acceptable range in the objective function and therefore cannot be directly used to infer trade-off, constraints or resolution on model parameters. These issues arise because most non-linear optimization algorithms do not perform importance sampling (i.e where the frequency distribution of sampled models is proportional to the objective function, or posterior distribution in a Bayesian framework), and hence the ensemble solution strongly depends on user choices, or on the class of algorithm employed.

[8] A typical example has been the use of the neighborhood algorithm [Sambridge, 1999a] for RF inversion [e.g., Piana Agostinetti et al., 2002; Reading et al., 2003; Bannister et al., 2003; Frederiksen et al., 2003; Hetényi and Bus, 2007]. In a second paper, Sambridge [1999b] invoked the Bayesian philosophy and showed how to calculate standard Bayesian outputs using an arbitrary distributed ensemble, i.e. one generated by any ensemble technique. However, most studies which employ the neighborhood algorithm do so only in an optimization context.

[9] In a Bayesian framework the objective is to create an ensemble of models that represent the posterior probability distribution quantifying the degree of belief we have about the Earth's structure and composition. This probability distribution combines “a priori” knowledge with information contained in the observed data. Models most consistent with both data and prior information correspond to the maxima of this distribution. The tails are described by poorly fitting models in the ensemble, and the “width” or the variance quantifies the constraints we have on model parameters, i.e the uncertainty on the inferred solution. The covariance of the posterior distribution provides information on the correlation or trade-off between model parameters.

[10] The “Bayesian neighborhood algorithm” [Sambridge, 1999b] was used for RF inversion by Lucente et al. [2005] and Piana Agostinetti and Chiarabba [2008]. Subsequently, Piana Agostinetti and Malinverno [2010] expanded the Bayesian formulation to the case where the number of layers is not fixed in advance but is treated as an unknown in the problem. At first sight this may sound like an unrealistic prospect, as there would seem to be little to prevent an algorithm introducing ever more detail into a model to improve data fit. However in a transdimensional Bayesian formulation, high dimensional, many layers, models are naturally discouraged [Malinverno, 2002]. This results from a convenient property of Bayesian inference referred to as “natural parsimony,” i.e. preference for the least complex explanation for an observation. Overly complex models suffer from over-fitting and so have poor predictive power. Therefore, given a choice between a simple model with fewer unknowns and a more complex model that provide similar fits to data, the simpler one will be favored in Bayesian sampling (see MacKay [2003] for a discussion). The preference for models with fewer unknowns is an intrinsic feature of transdimensional sampling algorithms.

[11] Transdimensional inversion, i.e. where the dimension of the model space is an unknown, was first used in Earth Science by Malinverno [2002] for DC resistivity sounding. Since then, it has rapidly become popular, and has been introduced to a wide range of areas such as geostatistics [Stephenson et al., 2004], thermo-chronology [e.g., Stephenson et al. 2006], geochronology [Jasra et al., 2006], palaeoclimate inference [e.g., Hopcroft et al., 2007, 2009], inverse modeling of stratigraphy [Charvin et al., 2009a, 2009b], seismic tomography [Bodin and Sambridge, 2009], wire-line log data interpretation [Reading et al., 2010], change point modeling of geochemical records [Gallagher et al., 2011], geoacoustic inversion [Dettmer et al., 2010], potential fields studies [Luo, 2010], and inversion of electromagnetic data [Minsley, 2011].

[12] Piana Agostinetti and Malinverno [2010] appears to be the first application of a transdimensional algorithm to the receiver function problem. In this paper, we extend their scheme to the hierarchical case where data noise levels are also treated as unknowns. Our scheme also solves a longstanding problem in geophysical inversion, i.e. how to determine the relative weights applied to different data types (e.g. receiver functions and dispersion measurements) during an inversion.

1.2. Receiver Function Variance

[13] As shown by Gouveia and Scales [1998], the level of data uncertainty estimated prior to inversion (i.e. the covariance matrix of data noise) plays a critical role in Bayesian inference. In an optimization framework the solution does not depend on the level of data noise (since the best fitting model remains the same as we rescale all error bars of the data). In contrast, with a Bayesian framework, the data uncertainty directly determines the form of the posterior probability distribution and hence the posterior samples generated from it.

[14] In the context of a transdimensional inversion, the variance of data noise becomes even more important. Piana Agostinetti and Malinverno [2010] showed a clear relationship between the data errors and number of interfaces in the sampled models. A transdimensional Bayesian procedure automatically adapts the complexity of the solution in order to fit the data up to the level of noise determined by the user. Of course, as more model parameters (e.g. more layers) are introduced, the data could be fitted better, but the procedure naturally prevents the data to be fitted more than the given level of noise [for a recent example, see Dettmer et al., 2010].

[15] For receiver functions the noise is correlated from sample to sample by the band-limited nature of the waveforms. The uncertainty can be characterized into three types. Firstly, observational errors on the seismic waveform result from background seismic noise (micro-seisms) and from the instrumental noise affecting the recording. Often, outlier RFs in the stack are eliminated from visual inspection and hence there is no clear quantification of the degree of observational noise. Secondly, processing errors occur in the deconvolution between components of the seismogram, which is an unstable operation. The frequency domain deconvolution is stabilized with a water-level scheme, whose parameter is chosen by trial-and-error [Clayton and Wiggins, 1976].

[16] In addition, there are assumptions made about the Earth (e.g. horizontal homogeneous isotropic layers) that prevent us from reproducing the observed RFs. We refer to the part of the data that cannot be modeled by our physical approximation of the Earth as “theory errors.” This type of noise is coherent and fully reproducible, and following the definition of Scales and Snieder [1998], it is a part of the signal we choose not to explain. For example, the complex structures near the receiver produce a scattered wavefield that is not taken in account in our forward model and which is thus considered as data noise in the inversion. A Gaussian filter is applied to limit the final frequency band, in order to reduce the sensitivity to fine structure.

[17] As shown by Di Bona et al. [1998], all these contributions to the RF variance may not be simply additive and an overall quantification of the noise in terms of magnitude and correlation is often difficult.

1.3. The Covariance Matrix of Data Errors

[18] In most Bayesian studies, the data noise is assumed to be normally distributed, and represented by a covariance matrix of data errors Ce. Ammon [1992] estimated the noise level from the power-spectral density in a pre-signal time window (the segment which precedes the direct P-wave arrival). Sambridge [1999b] estimated a noise covariance from multiple realizations of correlated noise waveforms. Piana Agostinetti and Malinverno [2010] derived the noise correlation from the averaging function which is calculated by deconvolving the vertical component of motion from itself, using the chosen water-level parameter. If the water-level fraction is zero, the result is a perfect Gaussian (from the low-pass filter included in the procedure). Di Bona et al. [1998] evaluated the noise involved in the frequency domain deconvolution by using the residuals of a time domain deconvolution of the averaging functions from the computed RFs, in the portion preceding the P-pulse.

[19] These different schemes approximate different effects, and it is clear here that there is no consensus to date on the way of measuring noise in RFs. Although these techniques can be used to infer the level of observational and processing noise, they do not estimate uncertainties due to theoretical errors. Note however that Gouveia and Scales [1998] accounted for model discretization errors in the case of Bayesian seismic waveform inversion. In the case of geoacoustic inversion, Dettmer et al. [2007, 2008] propose to estimate the covariance of data errors from analysis of data residuals obtained from a maximum likelihood solution.

[20] In this work we address the issue of noise estimation with a Hierarchical approach [Malinverno and Briggs, 2004; Malinverno and Parker, 2006]. The Hierarchical Bayes model is so named because it has two levels of inference. At the higher level are “hyper-parameters” such as the noise variances of the data. At the lower level are the physical parameters that represent Earth properties, e.g. seismic velocities. Information on physical parameters at the lower level is conditional on the values of hyper-parameters selected at the outer level. Overall, a joint posterior probability distribution is defined both for hyper-parameters and Earth parameters.

[21] Here we use a Hierarchical Bayes formulation where both the variances and correlation parameters of data noise are treated as unknowns in the inversion. In this way we fully take account of the complex combination of effects contributing to the misfit. In the context of a variable number of layers, we shall show that this can be a major advantage over having fixed noise estimates.

1.4. Joint Inversion With Surface Wave Dispersion Measurements

[22] Although RFs are particularly suited to constrain the depth of discontinuities, they are only sensitive to relative changes in S-wave velocities in different layers, and poorly constrain absolute values. Conversely, surface waves dispersion (SWD) measurements are sensitive to absolute S-wave velocities but cannot constrain sharp gradients, and are poor at locating interfaces [Juliá et al., 2000]. The difficulty of quantitatively utilizing data sets with different sensitivities have resulted in most models of shear-wave velocity being based only on either RFs or SWD.

[23] However, with increasing computational power, methods to jointly invert RF and SWD are gaining in popularity [e.g., Özalaybey et al., 1997; Du and Foulger, 1999; Juliá et al., 2000, 2003; Chang et al., 2004; Lawrence and Wiens, 2004; Yoo et al., 2007; Tkalčić et al., 2006; Moorkamp et al., 2010; Tokam et al., 2010; Salah et al., 2011]. The motivation for this approach is to improve resolution, and reduce the non-uniqueness of the problem and influence of noise. Furthermore, if different types of data are inverted together, the complementary constraints are likely to better resolve structure.

[24] In this work we propose to invert RFs jointly with observations based on the cross-correlation of ambient noise recorded at nearby receivers [Campillo and Paul, 2003; Shapiro and Campillo, 2004; Stehly et al., 2009], which provides apparent travel times of surface waves at periods ∼1–30 s (mostly sensitive to the crust). In the context of joint inversions, assessment of data uncertainty becomes crucial in the construction and evaluation of a misfit function. This is because data sets of different nature have different levels of noise, and their relative uncertainty determines their relative contribution to the misfit. Often, some arbitrary weighting factor is chosen which begs the question of whether maximum benefit is being obtained from the joint inversion. In this paper we show that a Hierarchical Bayes procedure appears to be effective in this situation, as it is able to quantify the level of information brought by different data types in a self consistent manner.

2. Methodology

2.1. Model Parameterization

[25] In this study the radial RF and SWD are assumed to be dominated by the response of homogeneous horizontal layers beneath the receiver. The geometry of layers is described by a variable number, k, of Voronoi nuclei as shown in Figure 1.The layer boundaries are defined as equidistant between adjacent nuclei, with the lowest layer a half space. Each layer i (with [1k]) is therefore determined by the depth of its nucleus ci and by a shear wave velocity value vi. By allowing the number of layers, k, to be variable as well as both the position of the nuclei, c, and velocities, v, we have a highly flexible parameterization of variable thickness layers (see Figure 1). In our transdimensional approach, this dynamic parametrization will adapt to the spatial variability in the velocity structure information provided by the data.

Details are in the caption following the image
The model is parameterized with a variable number of Voronoi nuclei (red squares) which define the seismic structure (blue line). The vertical location of nuclei define the geometry of layers which Vs value is given by horizontal positioning of nuclei. Note that Voronoi nuclei are not necessary at the center of layers but rather boundaries are defined as equidistant to adjacent nuclei.

[26] We also make inference on the covariance matrix of data noise Ce, which can be expressed with a number of hyper-parameters h = (h1, h2, …hm), treated as unknowns in the inversion. Therefore, the complete model to be inverted for is defined as m = [c, v, k, h].

[27] As described by Lombardi [2007], the timing of RFs are relative to the first P arrival and thus very sensitive to the variation of Vs relative to Vp. Note that RFs are also sensitive to crustal attenuation. However here, we consider the Vp/Vs ratio as well as attenuation coefficients fixed to a reference model and we only invert for Vs in each layer. Furthermore, we use only the simplest possible representation of velocity within each layer, i.e. a constant, although higher order polynomials are possible, e.g. a linear gradient or quadratic.

[28] In our inversion study, inappropriate modeling assumptions (e.g. no dipping layers or anisotropy) may manifest themselves in a poor data fit. In a conventional Bayesian framework, these theory errors have to be taken into account in the data noise covariance matrix which is often impractical (see Gouveia and Scales [1998] for details). For example, how would one quantify the magnitude and correlation of data noise generated by approximating a dipping layer as horizontal, or a complex anisotropic medium as isotropic ? An advantage of the Hierarchical Bayes formulation is that we let the data infer its own degree of uncertainty, and hence theory errors, while still present, are allowed for in the estimation of the data noise and acceptable levels of data fit.

2.2. The Forward Calculation

[29] Our direct search algorithm requires solving the forward problem a large number of times, that is to compute the RF predicted by a given Earth model parameterized as described above. We use the Thomson-Haskell matrix method [Thomson, 1950; Haskell, 1953] to compute the spectral response of a stack of isotropic layers to an incident planar P-wave. Multiple reflections are considered with this method. Since this way of solving the forward model is achieved without slowness integration, it is fast and has been widely used in Monte Carlo algorithms [e.g., Shibutani et al., 1996; Sambridge, 1999a]. Once synthetic seismograms have been computed for different components, receiver functions are made via frequency domain deconvolution of the vertical component from the radial component using water-level spectral division [Langston, 1979] with a water-level of 0.0001. In the case of a joint inversion, the forward method used to calculate surface wave dispersion is DISPER80 developed by Saito [1988], this algorithm does not consider seismic attenuation. Since the proposed inversion algorithm is a direct parameter search, the forward calculations are separate routines independent of the main algorithm, and hence they can be easily replaced by alternative algorithms.

2.3. Bayesian Inference

[30] In a Bayesian approach all information is represented in probabilistic terms [Box and Tiao, 1973; Smith, 1991; Gelman et al., 2004]. Geophysical applications of Bayesian inference are described by Tarantola and Valette [1982], Duijndam [1988a, 1988b], and Mosegaard and Tarantola [1995]. The aim of Bayesian inference is to quantify the a posteriori probability distribution (or posterior distribution) which is the probability density of the model parameters, m, given the observed data, dobs, written as p(m|dobs) [Smith, 1991]. In a transdimensional formulation, the number of unknowns is not fixed in advance, and so the posterior is defined across spaces with different dimensions. This transdimensional probability distribution is taken as the complete solution of the inverse problem. In practice one tends to use computational methods to generate samples from the posterior distribution, i.e. an ensemble of vectors m whose density reflects that of the posterior distribution.

[31] Bayes' theorem [Bayes, 1763] is used to combine prior information on the model with the observed data to give the posterior probability density function:
urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0001
urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0002
where x|y means x given, or conditional on, y, i.e. the probability of having x when y is fixed. m is the vector of the model parameters and dobs is a vector defined by the set of observed data. The term p(dobs|m) is the likelihood function, which is the probability of observing the measured data given a particular model. p(m) is the a priori probability density of m, that is, what we (think we) know about the model m before measuring the data dobs.

[32] Hence, the posterior distribution represents how our prior knowledge of the model parameters is updated by the data. Clearly, if the prior and the posterior distributions are the same, then the data add no new information.

[33] From an ensemble of models distributed according to the posterior, it is straightforward to determine special properties like the best or average model, or to construct marginal probability distributions for individual model parameters. Correlation coefficients between pairs of parameters can also be extracted [Gelman et al., 2004].

2.4. The Likelihood Function

[34] The likelihood function p(dobs|m) quantifies how well a given model with a particular set of parameter values can reproduce the observed data.

[35] The observed receiver function can be written as
urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0003
where n is the size of the data vector, and ϵ(i) represents errors that are distributed according to a multivariate normal distribution with zero mean and covariance Ce, which may be unknown. We recognize that the Gaussian assumption may itself be questionable in some cases. Furthermore, by assuming the normal distribution has zero mean, we do not account for systematic errors.
[36] In the case of correlated data noise, the fit to observations, Φ(m), is no longer defined as a simple “least-square” measure but is the Mahalanobis distance [Mahalanobis, 1936] between observed, dobs, and estimated, g(m), data vectors:
urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0004
In contrast to the Euclidean distance, this measure takes in account the correlation between data (equality being obtained where Ce is diagonal). The general expression for the likelihood probability distribution is hence:
urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0005
Note that this expression requires both the inverse Ce−1 of the data noise covariance matrix and also its determinant |Ce|.

[37] When treating the data noise as a variable, one might intuitively expect the algorithm to choose high values for the variance of data noise (i.e. the diagonal elements of Ce) because this would reduce the misfit in (4). However, the Gaussian likelihood function is normalized by |Ce| in (5) and here high data errors implies a low likelihood. Hence the value taken by the magnitude of data noise has two competing effects on the likelihood.

2.5. The Prior

[38] The Bayesian formulation enables one to account for prior knowledge, provided that this information can be expressed as a probability distribution p(m) [Gouveia and Scales, 1998]. All inferences from the data are then relative to this prior. In our 1D seismic inverse problem, this prior information is what we think is reasonable for the shear-wave velocity model we want to infer, according to previous studies.

[39] In a stimulating short essay, Scales and Snieder [1997] reviewed the philosophical arguments that have been invoked for and against Bayesian inversion. The principal criticism made to Bayesian inversion is that users often “tune” the prior in order to get the solution they expect. In other words, the a priori knowledge of the model is often used as a control parameter to tune the properties of the final model produced. Therefore, one can easily argue that in a Bayesian framework, the solution is influenced by the form of the prior distribution, whose choice is subjective.

[40] However in the examples shown here, the final models will be dominated by the data rather than by prior information and so we do not consider this to be a major limitation. This is because we assume unobtrusive prior knowledge by setting priors to uniform distribution with relatively wide bounds, although we acknowledge that uniform distributions are very informative about their bounds, and hence it is not possible to have a completely uninformative prior.

[41] The complete mathematical form of our prior distribution is detailed in Appendix A.

2.6. Transdimensional Inference

[42] Given the Bayesian formulation described above, our goal is to generate a collection or ensemble, of Earth models distributed according to the posterior function. In our problem, we do not know the number of layers, i.e. the dimension of the model space is itself a variable, and hence the posterior becomes a transdimensional function. This can be sampled with a generalization of the well known Metropolis-Hasting algorithm [Metropolis et al., 1953; Hastings, 1970] termed the reversible-jump Markov chain Monte Carlo (rj-McMC) sampler [Geyer and Møller, 1994; Green, 1995, 2003] which allows inference on both model parameters and model dimensionality.

[43] A general review of transdimensional Markov chains is given by Sisson [2005] and Gallagher et al. [2009] present an overview of the general methodology and its application to Earth Science problems. The reversible jump algorithm is described in previous studies [e.g., Malinverno, 2002; Gallagher et al., 2011]. Here we only give a brief overview of the procedure, and present the mathematical details of our particular implementation in Appendices.

[44] The rj-McMC method is iterative in which a sequence of models are generated in a chain, where typically each is a perturbation of the last. The starting point of the chain is selected randomly and the perturbations are governed by a proposal probability distribution which only depends on the current state of the model. The procedure for a given iteration can be described as follows: (1) Randomly perturb the current model, to produce a proposed model, according to some chosen proposal distribution (see Appendix B). (2) Randomly accept or reject the proposed model (in terms of replacing the current model), according to the acceptance criterion ratio (see Appendix C).

[45] The first part of the chain (called the burn-in period) is discarded, after which the random walk is assumed to be stationary and starts to produce a type of “importance sampling” of the model space. This means that models generated by the chain are asymptotically distributed according to the posterior probability distribution (for a detailed proof, see Green [1995, 2003]). If the algorithm is run long enough, these samples should then provide a good approximation of the posterior distribution for the model parameters, i.e. p(m|dobs).

[46] This “ensemble solution” contains many models with variable parameterization, and inference can be carried out with ensemble averages over the structure [see Piana Agostinetti and Malinverno, 2010]. For example, the posterior probability of the shear wave velocity at a given depth can be visualized simply by plotting the histogram of the values selected for the ensemble solution.

[47] In terms of choosing a single model for interpretation, we can consider the average over the ensemble of sampled models. This is known as the expected model, and is in fact a weighted average, in which the weighting is through the posterior distribution (sampled by the rj-McMC algorithm). All the models sampled have a particular parameterization defined by the number and position of their interfaces. When a large number of models are averaged, the positions of well defined interfaces will tend to overlap while less well defined ones will tend to cancel out. This spatial average model is effectively a continuous line which will capture the well resolved parts of the model [Bodin and Sambridge, 2009].

[48] Here there is no need for statistical tests or regularization procedures to choose the adequate model complexity or smoothness corresponding to a given degree of data uncertainty. Instead, the reversible jump technique automatically adjusts the underlying parametrization of the model to produce solutions with appropriate level of complexity to fit the data to statistically meaningful levels.

2.7. Data Uncertainty Quantification: Hierarchical Bayes

[49] The covariance matrix of data noise Ce in (4) imposed at the outset has a direct effect on the solution of the reversible jump algorithm and implicitly acts as a smoothing parameter. This can be seen as an advantage over optimization based schemes where the level of smoothing is chosen a priori or interactively. However, in seismology, assessment of measurement errors can be difficult to achieve a priori. Without any reliable information about the data uncertainty, it is impossible to give a preference between two solutions obtained with different values of Ce.

[50] Fortunately, an expanded Bayesian formulation can take into account the lack of knowledge we have about data errors. Following current statistical terminology, the data noise covariance matrix is expressed with a number of hyper-parameters (i.e. Ce = f(h1, h2, …)), and the method used is known as Hierarchical Bayes [Gelman et al., 2004]. The model to be inverted for is defined by the combined set m = [c, v, k, h], where c and v are the vectors containing the nuclei locations and velocity values, and h = (h1, h2, …) is a vector of hyper-parameters defining the unknown data errors. The Hierarchical algorithm is implemented in the same manner as the conventional reversible jump, the only difference being that here we add an extra type of model perturbation, i.e. a change in the hyper-parameter vector h. As for other model parameters, each time h is perturbed, a new value is randomly proposed from a given distribution, and the new value of data noise is either accepted or rejected according to the acceptance criterion ratio (see Appendix C).

2.8. Parameterizing the Covariance Matrix of Data Noise

[51] As explained before, our philosophy is to consider the level of data noise as an unknown in the inversion. Therefore the main issue here is to “parameterize” the noise covariance matrix Ce, i.e. to express it as a function of a given number of hyper-parameters. This is a symmetric n × n matrix defined with (n2 + n)/2 values which are obviously impossible to estimate separately from only n data, and hence some assumptions need to be made. The noise covariance can be written in terms of a matrix of correlation R and a vector of standard deviations s:
urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0006
With this decomposition, one can separate two properties of the noise, i.e its magnitude and correlation [Piana Agostinetti and Malinverno, 2010]. For simplicity, in this study the noise is considered stationary, i.e. its magnitude and correlation are constant along the time series (although we acknowledge that this might not be always the case in RFs), then Ce can be written:
urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0007
where σ2 is the constant noise variance, i.e. the magnitude of data noise. (In a case of a non-stationary time series, σ can be parameterized as a linear function of time σ(t) = h1 × t + h2). R is a symmetric diagonal-constant or Toeplitz matrix:
urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0008
where ci(i = [1,n]) describes the noise correlation between points of the series. Thus c1 defines the correlation between two adjacent points, and more generally ci is the noise correlation between points that are i samples apart in the series. The key properties that we need are that the correlation function decreases with distance, with limiting values of 1 at i = 0 and of 0 at i = ∞. This is the most common kind of association found in times series. Then, the main question is how to parameterize the correlation function ci and with how many unknowns? Below we present two types of parameterization for the noise correlation.
[52] We first propose a parameterization which is convenient to implement for our particular problem. The correlation function is simply assumed to decay exponentially and is thus given by
urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0009
where r = c1 is a constant number between 0 and 1. The major advantage of such a parameterization is that the inverse and determinant of Ce needed in the likelihood in (5) have simple analytical forms, i.e. they can be expressed in terms of our two hyper-parameters h = [σ,r] describing the magnitude and correlation of data noise (see Appendix D).
[53] A second type of parameterization that is commonly used to model the noise in RFs is a Gaussian correlation law:
urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0010
Compared to the first type of correlation, here there are no high frequency components, and hence this form of noise clearly seems closer to what is observed in receiver functions before the first P-arrival (see Appendix D). This is because a Gaussian filter is used in the deconvolution process to remove high frequency noise that has high amplitude and which blurs the signal. Although this type of noise parameterization appears to be more realistic in the case of RF, there are no stable analytical formulations for the inverse and determinant of Ce. Therefore Ce−1 and |Ce| have to be numerically computed, which is computationally expensive and cannot be carried out each time Ce is perturbed along the random walk. Therefore, here one can only invert for the magnitude of noise h = [σ] whereas its correlation r needs to be chosen by the user.

2.9. Adding Surface Wave Dispersion Data Into the Problem

[54] Given the framework described above, it is straightforward to invert jointly independent data types with different units and levels of noise. This is done simply by defining the data vector d and covariance matrix of data errors Ce in (5) as a concatenation of the data vectors and noise covariance matrices. In the case of a joint inversion of RF and SWD, this yields
urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0011
urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0012
where CeRF is constructed with two parameters σRF and rRF. For surface wave dispersion data, we assume the measurement error is constant with period and CeSWD is also constructed with σSWDand rSWD (note that instead of being constant, σSWD could also be parameterized as a linear function of period).

[55] In this way, Ce can be parameterized with different noise parameters for each data type. As can be seen in (4), the level of noise in different data types controls their contribution in the misfit function, and hence their influence in the solution. Therefore, by inverting for data noise, we let the data themselves infer the contribution of each data type in the misfit, without having to define a scale factor to weight independent data sets.

[56] An alternative to this joint Bayesian inversion would be a two-step inversion where the posterior distribution obtained after inverting a first data set d1 would determine the prior for a second inversion based on the second data set d2. In the case where the two data sets are independent, we can write p(d1,d2|m) = p(d1|m)p(d2|m), and from (2) the posterior solution after the two step inversion would be identical to the posterior solution for the joint inversion.

3. Inversion of Synthetic Data

[57] We first test our algorithm with synthetic data computed from a known velocity model made of 6 horizontal layers. The true model (red line in Figure 3) presents two major features often targeted by RF studies: a low S-wave velocity layer in the crust (between 10–20 km) and a strong velocity increase at the Moho (at about 30 km depth). A synthetic receiver function (red line in Figure 2) is calculated from the true model with the forward method described in section 2.2, and a correlated random noise is added, which results in the “observed” receiver function (blue line in Figure 2). An alternate approach would be to add noise to the seismic waveforms before the deconvolution. However, here the receiver function waveform is the data vector to be inverted. By directly adding the correlated noise to this vector, we know the exact form of the data noise, and can verify that the proposed algorithm is able to recover it. The synthetic noise is generated according to a covariance matrix Ce defined with the first type of correlation (i.e. ci = ri) with values σtrue = 2.5 × 10−2 and rtrue = 0.85, and hence inversions carried out in this section assume an exponential correlation law. We recognize that this type of noise contains high frequency signals that are normally filtered out in real data. Therefore the results presented in this section should only be seen as a “proof-of-concept,” and not as a statement of the optimal noise parameterization.

Details are in the caption following the image
Simulated receiver function. Red: Synthetic data estimated from the true model in red in Figure 3. Blue: RF with added Gaussian random noise generated with an exponential correlation law (i.e. ci = ri).

[58] In order to illustrate the different features of the algorithm, we first present results for a conventional transdimensional RF inversion with fixed noise estimates. Then, we extend the formulation to Hierarchical models, i.e. treat the data noise parameters (σ and r) as variables in the inversion. Finally, surface wave dispersion measurements with unknown errors are added into the problem for a joint inversion.

3.1. Transdimensional Inversion of RF With Fixed Noise Parameters

[59] The purpose of this section is to show that in a standard transdimensional inversion, i.e. where the estimated data noise is fixed to some values given by the user prior to the inversion, the form of the solution strongly depends on the choice of the covariance matrix of data errors. Hence the situation in this section is similar to that considered by Piana Agostinetti and Malinverno [2010]. The RF shown in Figure 2 has been inverted twice with a different matrix Ce at each run. The inversion is first carried out with the correct noise estimates (Figures 3a3c), and then using incorrect values for σ and r (Figures 3d3f). In the second case, σ has been underestimated by 40% (we use 0.015 instead of 0.025), and r has been overestimated by 8% (we use 0.92 instead of 0.85). Hence in this test we observe the effects of misestimating the data noise.

Details are in the caption following the image
Transdimensional inversion of the synthetic RF (in blue in Figure 2). Here the noise parameters σ and r are kept fixed during the inversion to predefined values. (a–c) Results when noise estimates are set equal to the values used to construct the synthetic noise. (d–f) Results when σ and r are respectively under- and over-estimated relative to their “true” values. In this case the posterior approximation of the true model in red is clearly worsened. Figures 3a and 3d show posterior probability distribution for Vs at each depth. Red shows high probabilities and blue low probabilities. The synthetic true velocity model is plotted as a red line. Figures 3b and 3e show posterior probability for the position of discontinuities. Red lines show depth of interfaces in the true model. Figures 3c and 3f show posterior probability on the number of cells. The red line shows the number of cells in the true model.

[60] For the two cases, the transdimensional sampling has been carried out allowing between 2 and 50 interfaces (nmin = 2 and nmax = 50). Bounds for the uniform prior distribution were set to 2–5 km/s for S-wave velocity values. Posterior inference was made using an ensemble of 105 models. The algorithm was implemented on 200 parallel cpus to allow large numbers of independent chains, starting at different random points, and sampling the model space simultaneously and independently. Each chain was run for 2 × 105 steps in total. The first half was discarded as burn-in steps, only after which the sampling algorithm was judged to have converged. To eliminate dependent samples in the ensemble solution, every 200th model visited in the second half was selected for the ensemble. The convergence of the algorithm is monitored with a number of indicators such as acceptance rates, and sampling efficiency is optimized by adjusting the variance of the Gaussian proposal functions (see Appendix B). (For details on convergence and independence of sampled models in Markov chains, see MacKay [2003].)

[61] The solution is given by the transdimensional posterior distribution which is represented by an ensemble of 1D models with variable number of layers and thicknesses. In order to visualize the final ensemble, the collected models can be projected into a number of physical spaces that are used for interpretation. For example, Figures 3a and 3d show the marginal distribution for S-wave velocities as a function of depth. At each depth, local information about the velocity model is represented by a complete distribution which can be seen as a marginal distribution of the posterior in this “interpretation space.” These marginal posteriors are shown as a color density map in Figure 3. In practice, the marginal posterior is simply constructed from the density plot (i.e. the histogram) of the ensemble of models in the solution. This density plot is convenient to visualize the ensemble solution, and it is particularly useful to demonstrate the constraints on Vs.

[62] If one is interested in assessing the resolution (number and position) of seismic discontinuities beneath the seismic station, it is possible to examine the ensemble solution from a different point of view and to plot the marginal posterior distribution on the location of interfaces. Figures 3b and 3e also show histograms of interface depths in the ensemble of models. For each depth, this function represents the probability density of having a discontinuity, given the data. This provides useful information on the inferred locations of transitions, which can be unclear in other plots. Note that the positions of interfaces are not direct model parameters, and hence this marginal distribution is again constructed by projecting the ensemble of sampled earth models into a visualization space.

[63] Since the models in the ensemble solution have varying number of cells, the complexity of the solution cannot be described with a single number k. However we plot Figures 3c and 3f the histogram of k across the ensemble solution that is directly proportional to the marginal posterior p(k|dobs). The number of layers in the true model is shown by the red line, and the uniform prior distribution on k, p(k) is shown in light blue.

[64] Clearly, the solution obtained with correct noise estimates gives better results (Figures 3a3c) than when the noise is misestimated (Figures 3d3f). Since the magnitude of noise has been underestimated, the algorithm automatically adds more layers than necessary and “overfits” the observed RF by fitting the unattributed noise. The expected number of layers in the model in Figure 3f is 15, which is twice the true value. Figure 3e shows that location of transitions are not recovered as well. However, even though features are generally degraded in comparison with Figure 3a, the most resolvable elements in Figure 3d (i.e. location of shallow discontinuities) are recovered. We verified in a separate experiment (not shown here) that if the data are assumed as noisier than they really are (i.e. σ in Ce is overestimated), then this would tend to fit a model that has too few cells (i.e. the model would be too simple).

[65] Therefore, the number of model parameters used, and hence the complexity of the solution, clearly depends on the estimated data noise imposed at the outset. The reversible jump technique automatically adjusts the underlying parametrization of the model to produce an average solution with just enough complexity to fit the data. This is a potential advantage over optimization based inversions. However, as seen before, assessment of measurements errors in RFs (and also in SWD) can be difficult a priori. Without any reliable information about the data uncertainty, it is impossible to give a preference between two solutions in Figure 3 obtained with different values of Ce.

3.2. Hierarchical Bayes Inversion of RF

[66] In this section we consider the situation where little is known about data noise. We repeat the experiment of section 3.1 with exactly the same data vector (Figure 2) but instead treat the noise parameters σ and r as unknowns. Interestingly, Figure 4 shows results for this test which are virtually identical to those obtained when noise parameters are fixed to their correct values (as in Figures 3a and 3b).

Details are in the caption following the image
Hierarchical Bayes inversion of the synthetic RF in blue in Figure 2. Here the noise parameters σ and r are treated as unknowns in the inversion. (a) Black lines show a random subset of 60 models in the ensemble solution, which contains 105 models, and which is fully represented with (b) a color density plot. The synthetic true velocity model in Figure 4a is plotted as a red line. The blue line shows the posterior mean model (or average solution) constructed by taking the average Vs at each depth across the ensemble solution. The green line shows the maximum of marginal posterior model (or maximum solution) which follows the maximum of the distribution on Vs with depth. (c) Posterior probability for the position of discontinuities. Red lines show depth of interfaces in the true model and correspond well to the peaks in the distribution.

[67] In Figure 4a, we show a random sample of 60 models in the ensemble solution. This is a very small subset of the ensemble solution and cannot be used to infer statistical properties, although it can be useful to visualize the solution. Note that some of these profiles are far from the true model in red and may not fit the data well, and reflect models from low probability regions of the posterior distribution. We show how particular 1D models for interpretation can be constructed from the ensemble solution. Firstly, we plot (in blue) the posterior mean model, simply constructed by taking the average Vs at each depth across the ensemble solution. We call this model the “average solution.” When models with different transition locations are added, the sharp changes present in individual models are smoothed out, while those at similar locations are reinforced. In this way, the average solution can exhibit at the same time sharp discontinuities and low gradients (blue line in Figure 4b). Instead of being predefined in advance by a single regularization parameter, the level of smoothness in the average solution is variable with depth and directly inferred by the data.

[68] A second model that can be constructed is the mode of marginal posterior which follows the maximum of the marginal with depth (shown in green in Figure 4a). We call this model the “maximum solution.” Note that these two models are merely properties of an ensemble of models that have variable parameterizations, and hence do not correspond to any particular individual member of the ensemble. Note also that the maximum solution model is different from the best fitting model in the ensemble, and from the model that maximizes the posterior distribution overall. By inspection it is clear that those models provide a good estimation of the true model in red. Furthermore, all five transitions present in the true model are well recovered in Figure 4c.

[69] Figure 5 shows posterior inference on the number of layers and noise hyper-parameters, together with prior distributions. The number of layers in the true model, as well as true values σtrue and rtrue used to generate the synthetic noise are showed in red. With scant information on data errors, and on the complexity of the true model prior to the inversion, the Hierarchical Bayes procedure has been able to infer the magnitude and correlation of data noise, which quantified the required level of data fit, and thus the number of model parameters needed in the inversion. Therefore this example demonstrates that, by allowing the user to formulate the full state of uncertainties she has about data noise, a Hierarchical Bayesian procedure enables one to correct for a lack of knowledge about data noise.

Details are in the caption following the image
Hierarchical Bayes inversion. (top) Posterior distribution for the number of layers, (bottom left) the standard deviation of data noise σ, and (bottom right) correlation of data noise r. The uniform prior is showed in light blue, and true values (used to construct the observed RF in blue in Figure 2) are shown in red.

[70] The RF inverse problem is highly non linear, and hence the posterior is far from being a unimodal Gaussian distribution. To illustrate this, we have plotted in Figure 6 the marginal distribution on Vs at 31 km depth. This cross-section corresponds to the dashed line in Figure 4b, and it is close to the “Moho transition” in the true model. As a result, the marginal distribution is influenced by velocity values in the two layers on each side of the Moho, and it has two maxima about these two values. In this case, one can see that the average solution in blue is not representative of the true model whereas the maximum solution in green is closer to the true velocity in the lower interface. Although the average solution model is smooth, the maximum solution model (jumping from one value to the other) is better at showing sharp transitions.

Details are in the caption following the image
Marginal posterior for Vs at 31 km depth (i.e. slightly after the Moho discontinuity). The distribution is clearly influenced by both Vs value taken above and under the discontinuity. Blue line shows the mean value (average solution).Green line is the maximum of the marginal (maximum solution), and the red line is the true value (i.e. the Vs value in the first layer of the mantle in the true model).

[71] Finally, we give an example of trade-off assessment between two model parameters, that is Moho depth vs S-wave velocity in the last layer of the crust. Again, here the depth of an interface is not strictly a model parameter but a useful feature that can be picked in any sampled model. The crust-mantle transition is defined in the Voronoi models as the closest discontinuity to 30 km. (We acknowledge that this definition for the Moho is somewhat arbitrary, however, here the main purpose is to illustrate trade-off assessment between selected seismic properties.) Figure 7 shows the 2D marginal posterior for the selected pair of parameters, which is obtained from the 2D histogram over the ensemble of models. In this way one can extract accurate and quantifiable information from the ensemble about the constraints and correlation for these parameters. This trade-off means unsurprisingly, that data are fit equally well when the Moho is deeper and Vs is higher or vice-versa, and reassuringly this limitation of the resolving power of the data is clearly evident in the analysis.

Details are in the caption following the image
Posterior 2D marginal for the parameters representing depth of Moho and Vs in the last layer of the crust. (The Moho is defined as the closest interface to 30 km in the Voronoi models.) White dashed lines show true values for both parameters.

[72] Another trade-off that can be quantified is the correlation between the number of cells k and the magnitude of data noise σ. Figure 8 shows the joint posterior distribution on these two parameters. As expected, as the model complexity increases, the data can be better fit, and the inferred value of data errors decrease. However, the degree of trade-off is limited and the data clearly constrains the joint distribution of the two parameters reasonably well.

Details are in the caption following the image
Joint posterior distribution for the number of layers k and the magnitude of data noise σ (note here that k is a discrete variable whereas σ is continuous). White dashed lines show the true values used to construct the synthetic RF. As expected, a clear negative correlation can be seen between the two variables.

3.3. Joint Inversion of RF and SWD

[73] In this last section of synthetic experiments, we show how surface wave dispersion measurements with unknown errors can be added to the problem without having to pre-define the weight given to different data types in the inversion. The same synthetic model as sections 3.2 and 3.1 is used to construct a synthetic RF and SWD (red lines in Figure 9), with forward methods mentioned in section 2.2. A synthetic noise randomly generated from (12) is added to data to produce the two observed data sets shown in blue in Figure 9. Since dispersion data are not time series but travel-time measurements at different periods, we consider the noise independent (i.e. not correlated), and use a diagonal covariance matrix to generate SWD noise (i.e. σtrueSWD = 0.1 and rtrueSWD = 0). The two values used to generate random noise for RF are σtrueRF = 4 × 10−2, rtrueRF = 0.85 with an exponential correlation law.

Details are in the caption following the image
Simulated data with and without added noise for (top) SWD and (bottom) RF. In order to illustrate the benefits of a joint inversion, here the magnitude of noise added to the receiver function is larger than in sections 3.2 and 3.1.

[74] In the inversion we assume for simplicity CeSWD diagonal in (12), and CeRF with the first type of parameterization (exponential correlation law), and hence invert for three hyper-parameters σRF, rRF, and σSWD. In order to show the constraints brought by each of the two data sets, we first show results obtained after inverting RF data alone (Figures 10a and 10b) and SWD data alone (Figures 10c and 10d). The posterior distributions in Vs are wide in Figure 10a and only the shallowest discontinuity is recovered in 10b. This is because the RF is more noisy than in sections 3.2 and 3.1, and hence the model is poorly recovered. The RF alone contains little absolute velocity information, and this gives rise to a non-uniqueness problem known as the velocity-depth ambiguity (see Figure 7).

Details are in the caption following the image
Separate and joint inversion results for the synthetic data in blue in Figure 9. (a and b) RF inversion. Receiver functions are sensitive to strong gradients in elastic properties (e.g. velocity discontinuities in the crust and upper mantle), but are not sensitive to absolute velocity structure. (c and d) SWD inversion. Contrarily to RFs, surface waves dispersion measurements are sensitive to absolute S-wave velocities but cannot constrain sharp gradients, and give poor results in locating interfaces. (e and f) Joint inversion. The information on seismic discontinuities is clearly revealed.

[75] As expected, when SWD data are inverted alone, average velocities are well recovered in Figure 10c but discontinuities are not constrained at all in Figure 10d. Joint inversion results in Figures 10e and 10f show a dramatic improvement. By adding dispersion curves to RF data, the information on discontinuities has been clearly revealed. This might seem counter-intuitive since SWD data are poor at locating interfaces. However, as shown in Figure 7, RF data exhibit a trade-off between velocities and depth of interfaces. Therefore, given this correlation relation, by constraining velocities, SWD data also indirectly constrain depth of discontinuities. Thus, the two data types are complementary.

[76] For both separate and joint inversions, the algorithm is able to recover the level of noise added to each data set, and hence to fit each data type up the required level. Posterior distributions on number of layers and noise parameters are not shown, since such plots would be similar to those in Figure 5. A number of experiments have been carried out by changing the number of data points and levels of noise in each data. In all cases, the algorithm allows inference on data errors and both data types are adequately fitted.

4. Inversion of Field Measurements

[77] To demonstrate how the algorithm fares on real data, we apply it to RF and ambient noise SWD data from the WOMBAT experiment [Rawlinson and Kennett, 2008], which is an extensive program of temporary seismic array deployments throughout southeast Australia. Each array consists of between 30 to 60 short period instruments that continuously record for between five to ten months. Over the last decade, a total of over 500 sites have been occupied resulting in a very large passive seismic data set that has been used for several studies [e.g., Graeber et al., 2002; Rawlinson et al., 2006; Rawlinson and Urvoy, 2006; Clifford et al., 2007; Rawlinson and Kennett, 2008; Arroucau et al., 2010; H. Tkalčić et al., Multi-step modeling of receiver-based seismic and ambient noise data from WOMBAT array: Crustal structure beneath southeast Australia, submitted to Geophysical Journal International, 2011]. Here we focus on SEAL3, which is one of the twelve temporary deployments occupying the eastern and central sub-province of the Lachlan Fold Belt located in New South Wales (Tkalčić et al., submitted manuscript, 2011). We show results of our 1D transdimensional joint inversion beneath a particular station.

4.1. Constructing Receiver Functions

[78] For each station of the SEAL3 array, Tkalčić et al. (submitted manuscript, 2011) constructed a RF waveform from a selection of events with mb ≥ 5.5, and with epicentral distances between 30° and 90° from the station, which ensured near-vertical incidence of P-waves. Furthermore, only events with back-azimuths confined between 0° and 90° (i.e. Tonga-Fiji) were used in this study. By choosing a narrow interval of ray-parameters and a narrow azimuthal range, possible Moho dip and anisotropy are neglected, which are only second-order effects in the context of deriving 1D models of the crust and upper mantle that are compatible with multiple geophysical data sets.

[79] For each event, all three components were cut for the same time window and rotated to radial and tangential. Then, radial RFs were calculated using the time domain iterative deconvolution procedure proposed by Ligorria and Ammon [1999]for a low pass Gaussian filter with parameter a = 2.5, which determines the width of the filter and hence the frequency content of the RF. In order to select RFs that are mutually coherent and could be stacked to determine an observed RF at each station, the cross-correlation matrix approach described by Tkalčić et al. [2011] and Chen et al. [2010] was used. More details and figures describing the preprocessing of RFs are given by Tkalčić et al. [2011]. In this study, we propose to invert for shear-wave structure beneath station S3A6 (−32.900S 148.909E), and hence the average RF across all selected events recorded at this station is used as the data vector in our algorithm (Figure 11). During the inversion, synthetic RFs predicted by proposed models are computed with a simple frequency domain deconvolution technique as detailed in section 2.2. Hence the observed and estimated data vectors in our inversion are computed with different deconvolution methods. There is a scale factor of two in amplitudes in the two different RF computation, and the observed RF has been normalized for comparison with estimated data.

Details are in the caption following the image
Observed data for station S3A6. (left) Fundamental model Rayleigh wave group velocity dispersion measurements. The red curve shows the dispersion curve obtained with the model that best fits these data in the ensemble solution. (right) Receiver function calculated with the method of Tkalčić et al. [2011]. The red curve shows the RF calculated with the model that best fits the observed RF in the ensemble solution.

4.2. Ambient Noise Surface Wave Dispersion

[80] SEAL3 deployment provided a large amount of high-quality continuous records for the duration of more than seven months. Arroucau et al. [2010] recently used the large volume of recorded noise, which comes from diffuse sources of seismicity such as oceanic or atmospheric disturbances, to produce SWD measurements. The cross-correlation of the ambient noise wavefield was computed on the vertical component of all simultaneously recording station pairs. The resulting time-averaged cross-correlograms exhibit a dispersed wavetrain, which can be interpreted as the Rayleigh wave component of the Green's function of the intervening medium between the two stations [Shapiro and Campillo, 2004]. Rayleigh wave group traveltimes were then determined from the cross-correlograms (see Arroucau et al. [2010] for details).

[81] The goal of this study is to invert for a 1D shear-wave velocity model from simultaneous modeling of RF and SWD. However, RFs are associated with discrete points in the geographical space, i.e. the location of single stations, while SWD measurements are path averaged data related to station pairs. In order to construct a dispersion curve associated with the location of a station, Tkalčić et al. (submitted manuscript, 2011) adopted the following strategy: Station pairs located within a radius of 150 km around the station of interest were selected and an average dispersion curve was calculated from all the group velocity measurements performed in that area. In order to insure reliable group velocity estimates, a minimum distance between station pairs equal to two wavelengths was required. Furthermore, the average velocities were only calculated if more than twenty observations were available for a given period. (Note that another way to produce point measurements for SWD is to carry out 2D tomographic inversions of travel-times at each period). The average SWD curve thus obtained at station S3A6 is shown in Figure 11, and is inverted simultaneously with the observed RF at this station in order to infer a 1D shear-wave model beneath the recording site.

4.3. Results

[82] Posterior inference was made using an ensemble of about 5.2 × 105 models. The reversible jump algorithm was implemented for 288 parallel cpu-cores sampling the model space simultaneously and independently. Each chain was run for 1.2 × 106 steps. The first 3 × 105 models were discarded as burn-in steps, only after which the sampling algorithm was judged to have converged. Then, every 500th model visited was selected for the ensemble.

[83] The data noise covariance matrix Ce in (12) was parameterized with 2 hyper-parameters σRF and σSWD, both treated as unknowns in the inversion, while correlation parameters rRF and rSWD were kept fixed to predefined values. As shown above, dispersion data were considered as independent measurements, and hence rSWD = 0, although we acknowledge that measurements close in frequency could have correlated errors. Since P-phase waveforms were filtered during the RF construction process with a Gaussian filter (with a = 2.5), we used the Gaussian correlation function in (10) to model RF uncertainties. Hence, the correlation parameter rRF was kept fixed to a value corresponding to a white noise that has been filtered with a Gaussian filter with parameter a. After some elementary calculus, this yields rRF = fs/a, where fs is the sampling frequency in the RF (O. Gudmundsson, personal communication, 2010).

[84] Note that using the exponential correlation law in (9) to model RF uncertainties would erroneously assume high frequency components in the noise which have obviously been cut by the Gaussian filter (see Appendix D). Alternatively, we could have constructed RFs using exponential filters.

[85] As in our synthetic experiments, independent uniform prior distributions were used for each model parameter with bounds set to 2–50 for the number of layers, 2.5–4.5 km/s for S-wave velocity values, 0–60 km for depth of Voronoi nuclei (see Figure 1), 0–0.12 1/s for σRF, and 0–0.4 km/s for σSWD.

[86] The results obtained are shown in Figure 12. Almost all the computed S-wave velocity models are characterized by a very low velocity uppermost structure in the first kilometer of the crust. These low values are interpreted to be related to the presence at surface of either unconsolidated sediments or weathered exposed rocks. The upper-crust (0–15 km) shows complex structures and is characterized by the presence of velocity inversions. Indeed, Figure 12 (right) shows a large number of discontinuities in the first 15 km. The histogram of the location of transitions in the ensemble solution shows that the adaptive parameterization procedure automatically chose models with a number of thin shallow layers lying above thick deep layers. We suspect that this large number of inferred shallow discontinuities might be an indication of an inappropriate data noise parameterization. That is, by assuming the magnitude of data noise is constant along the time series, we give more weight to high amplitude signals (2 s and 5 s peaks in Figure 11), which are sensitive to shallow structures, and less weight to signals with a smaller amplitude, sensitive to deeper structures.

Details are in the caption following the image
Joint inversion of field data for station S3A6. (left) The ensemble solution, which contains 5.2 × 105 models, is fully represented with a color density plot. This gives an estimation of the Posterior probability distribution for Vs at each depth. (middle) Thick grey line shows the posterior mean model (or average solution) constructed by taking the average Vs at each depth across the ensemble solution. The thin black line shows the maximum of marginal posterior model (or maximum solution) which follows the maximum of the distribution on Vs with depth. (right) Posterior probability for the position of discontinuities constructed from the histogram of transition depths in the ensemble solution.

[87] The S-wave velocity in the second part of the crust (15–40 km) is relatively constant with a mean Vs = 3.75 km/s, although the inversion models also show some small velocity steps at these depths. It is difficult to identify a relatively sharp crust-mantle transition. The Moho is, rather, characterized as a gradient transition zone over a depth range of 40–45 km, which is consistent with the fact that S3A6 is located in the mountainous region like the Lachlan Fold Belt. At depths below 40 km, the posterior distribution on shear wave velocity is clearly bimodal with two modes at Vs = 4 km/s and Vs = 4.5 km/s. In this case, a typical feature is that the maximum model “jumps” from one mode of the distribution to another, resulting in a sharp discontinuity at 47 km. However, as can be seen in Figure 12 (right), this discontinuity is not required by the data, but rather results from the mode being an unstable measure of a bimodal distribution. Therefore, when interpreting results, the user needs to bear in mind that the information about Shear wave velocity is represented by a full probability distribution, and that the maximum model and average model are only properties of this distribution. The inferred seismic models are aligned with the results of other studies (Tkalčić et al., submitted manuscript, 2011). Although our analysis of the geological implications is rather limited, the aim here is to show that the ensemble solution produced by the transdimensional approach is in good agreement with the main geological features beneath the receiver.

[88] In order to emphasize the importance of having two independent geophysical data sets, we compare in Figure 13 results obtained after inverting the two data sets separately and jointly. In this way the joint solution in Figure 13 is equal to Figure 12 (left), but since SWD data are measured for periods ranging from 1 to 28 s, they only constrain shallow structure, and hence we only show results down to 25 km depth.

Details are in the caption following the image
(a–c) Posterior probability distribution for Vs at each depth. Results are shown for individual and joint inversion of data sets in Figure 11. (d) The average solution model for the three inversions carried out.

[89] As is well known, RFs are poor at constraining absolute velocities. The predictions made by RF alone significantly underestimate the S-wave velocities, as revealed by joint inversion (see the average solution models in Figure 13d). Figure 13b shows the constraints given by SWD alone. The interpretation in this case is limited to the absolute velocity, without any indication where the discontinuities in elastic properties are. When adding SWD to RF (Figure 13c), the velocity profiles tend to change in terms of absolute velocity to accommodate the dispersion data, but without significant alterations in their overall shape as a function of depth, although we have seen in synthetic experiments that even this is possible due to Vs/depth trade-off (see Figure 10).

[90] Similarly to synthetic examples in Figure 10, here the inferred uncertainties on the 1D velocity model, i.e. the width of the marginal posterior at each depth, is reduced when adding SWD into the problem, especially in the depth range 5–15 km. This quantitatively shows the advantage of simultaneously inverting different data sets. However, note that at shallow depths (2–4 km), the joint inversion seems to increase model uncertainties. This might be due to inconsistency between data sets at these depths, which is accounted for in the inversion as data noise, i.e. inability of a given model to fit the data. Indeed, the volume of Earth sampled by RFs (sensitive to structure directly localized under the station) and SWD data (sensitive to averages over large volumes under and around the station) is quite different.

[91] Since we are using a Gaussian correlation law to model CeRF, the parameter rRF needs to be fixed at the outset to a predefined value (see Appendix D). To validate this choice, we compare the residual waveform (observed - predicted) for the model that best fits the RF in the ensemble solution, to a realization of a random noise generated from the inferred expected CeRF. If the choice of rRF is adequate, and if σRF has been correctly inferred from the data, the noise realization and residuals should have similar properties (i.e. variance and smoothness). This is because the data residuals can be considered a realization of the data errors, i.e. the data noise is defined as the component of the measurements that cannot be explained by g(m). From visual inspection, one can see in Figure 14 that residuals and estimated noise for RF data are similar. Although this is a qualitative test, note that posterior error validation can also be carried out by applying quantitative tests to residuals resulting from one or an ensemble of models [see Dettmer et al., 2007, 2008, 2009, 2010].

Details are in the caption following the image
Comparison between residuals and estimated noise for RF data in the portion following the P-pulse for the Joint inversion. (top) Residual waveform (observed-predicted) for the best fitting model. (bottom) Example of a random noise realization generated from the expected hyper-parameter σRF and the fixed parameter rRF. From visual inspection, the two signals present the same variance and smoothness, which indicates that the magnitude of data noise has been estimated correctly, and that a reasonable input r value has been chosen.

5. Conclusion and Future Work

[92] Teleseismic receiver function analysis is now a well-established seismological technique, and a large number of schemes have been implemented in the last 30 years to infer seismic structure beneath broadband stations. In the last 15 years, receiver functions have been inverted together with surface wave dispersion curves and a number of joint inversion algorithms have been proposed. However, a recurring problem is the definition of the misfit function one tries to minimize and particularly the role of the data noise in this function. Here we have presented a novel joint inversion method where a Bayesian formulation is employed to produce a posterior probability distribution, where each model parameter can be described with a full probability density function. While the variance of the posterior can be used to assess uncertainty on model parameters, the posterior covariance directly quantifies the trade-offs (i.e. the correlation) between parameters. Hence the posterior can be examined from several point of views to infer different properties of the model (e.g. depth of transitions, mean Vs value at one depth, number of layers, etc).

[93] The parameterization of the Earth is adaptive and information is extracted from an ensemble of models with variable number of layers. But beyond the transdimensional character of the inversion, an original feature is that little needs to be assumed about the data noise covariance matrix. This matrix plays a crucial role in a Bayesian problem, since it directly determines the form of the posterior. The covariance of data errors represents the magnitude and correlation of data noise, which in the case of RFs and SWD can be difficult to quantify. In the case of a joint inversion, this matrix directly determines the level of information brought by each data set into the solution. Our philosophy is to let the data infer its own degree of uncertainty by treating the magnitude and correlation of noise as unknowns in the problem.

[94] There is a relative freedom in the design of 1D solution profiles needed for interpretation, e.g. average solution model, maximum solution model, best fitting model, etc. Furthermore, instead of using the whole ensemble of collected models representing the posterior, one can first compute the expected value of hyper-parameters (expected number of cells or expected data noise) and construct a solution profile by only accounting for models that take these hyper-parameters values. This is known in the statistical literature as Empirical Bayes. Alternatively, instead of using expected values of hyper-parameters, one can take as well the mode of the marginal distribution on hyper-parameters.

[95] A potential criticism of our methodology is the computational cost. If the Earth is defined by too many parameters, the number of models needed to sample the posterior distribution becomes large. And since the predicted data have to be computed each time a model is proposed, our algorithm may become computationally prohibitive. Here it is necessary to recognize that, even parallelized and optimized, our method is between 1 and 3 order of magnitude slower than standard linearized inversions. Also, we have here focused on the mathematical problem and illustrated the algorithm in simple situations where a number of approximations have been made. We only inverted for S-wave velocity structure while considering Vp/Vs ratio constant throughout the velocity model. An obvious improvement of the algorithm would be, at increased computational cost, to also consider Vp/Vs ratio in each layer. In addition, layers have been assumed homogeneous and horizontal and it would be possible to treat anisotropy, slope of discontinuities, and lateral variations as unknowns in the problem. These improvements could be achieved by using densely spaced arrays, by including earthquake waveforms from a wide range of back-azimuth, and using more sophisticated forward solvers.

[96] The approach presented in this paper is a general joint inversion strategy, and it has a wide range of possible applications in geosciences (given that the model space is not too large). The method provides a solution model that can exhibit at the same time low gradients and sharp discontinuities. In this sense it appears to have a considerable potential in Earth sciences, given the dual continuous/discrete nature of the Earth. Geophysical inverse problems that seem appropriate here include resistivity surveying with vertical electric sounding or electromagnetic surveys (EM) [Lowrie, 1997], inversion of frequency-domain airborne electromagnetic (AEM) data [e.g., Brodie and Sambridge, 2006, 2009], potential fields studies [Luo, 2010], or seismic cross-hole tomography [Nicollin et al., 2008]. From a statistical inversion point of view, the main difference between different geophysical inverse applications lies in the description of the forward problem. Since the method is based on a direct parameter search algorithm, any forward problem can be conveniently inverted. Hence the ideas here are relevant to other geophysical inverse problems, where depth/time profiles or indeed any 1-D functions are sought.

Acknowledgments

[126] This research was supported under Australian Research Council Discovery projects funding scheme (project DP110102098). This project was also supported by French-Australian Science and Technology travel grant (FR090051) under the International Science Linkages program from the Department of Innovation, Industry, Science and Research. Calculations were performed on the Terrawulf II cluster, a computational facility supported through AuScope. Auscope Ltd is funded under the National Collaborative Research Infrastructure Strategy (NCRIS), an Australian Commonwealth Government Programme. Computer software implementing the algorithms described in this paper are available from the authors.

    Appendix A:: The Prior

    [97] Since we have independent parameters of different physical dimension, in this work we only consider priors that are separable and hence can be written as a product of independent 1D priors on each variable. In some cases one might want to introduce joint priors on a subset of the variables, for example by making the prior variance of velocity in each layer dependent on the depth of the layer. In principle this could be done with additional calculations, but the algorithm would then be more difficult to implement and more computationally expensive. Here, the prior probability distributions is separated into three terms:
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0013
    where p(k) is the prior on the number of layers, and p(h) is the prior on noise hyper-parameters.
    [98] We choose for p(k) a uniform distribution over the interval I = {kϵN | kmin < k ≤ kmax}. Hence,
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0014
    where Δk = (kmax − kmin). Given a number of cells k, the prior probability distributions for the model parameters are independent from each other, and so can be written in separable form:
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0015
    Even though in the prior the parameterization variables c are independent of the velocity variables v, this will not be the case once the data are introduced, and hence we expect significant correlation in the posterior distribution as shown in Figure 7.
    [99] For velocity, the prior for the velocity vi in layer i is specified by a constant value over a defined interval J = {ℜ|Vmin < v < Vmax}. Hence we have
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0016
    where Δv = (Vmax − Vmin). Since the velocity in each layer is considered independent,
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0017
    As in the work of Bodin and Sambridge [2009], for mathematical convenience, let us for the moment assume that the Voronoi nuclei ci can only take place on an underlying grid of finite points defined by N possible depths. For k Voronoi nuclei, there are urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0053 possible configurations on the N possible depths of the underlying grid. We give equal probability to each of these configurations. Hence,
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0018
    Given a set of hyper-parameters h, the prior for each hyper-parameter hj is specified by a uniform distribution over a defined interval Hj = {ℜ | hminj < h < hmaxj}. Hence we have
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0019
    where Δjh = (hmaxj − hminj). Since we consider each hyper-parameter independent in the prior,
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0020
    where m is the number of hyper-parameters defining the data noise covariance matrix Ce. Therefore, after substituting (A5), (A6), and (A8) into (A1), the full prior probability density function can be expressed as
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0021
    given that all parameters in m fall into the range defined by their respective prior distribution. If at least one parameter falls outside the defined boundaries, the full prior becomes null.

    Appendix B:: Proposal Distributions

    [100] Having randomly initialized the model parameters by drawing values from the prior distribution of each parameter, the algorithm proceeds iteratively. At each iteration of the chain, we propose a new model by drawing from a probability distribution q(m′|m) such that the new proposed model m′ is conditional only on the current model m. At each iteration of the reversible jump algorithm, one type of move is uniformly randomly selected from the five following possibilities:

    [101] 1. Change the velocity in one layer. Randomly select a layer i from a uniform distribution over [1, k] and randomly propose a new value v′i using a Gaussian probability distribution centered at the current value vi:
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0022
    The variance θ12 of the Gaussian function is a parameter to be chosen. Hence we have
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0023
    where u is a random deviate from a normal distribution N(0,θ1). All the other model parameters are kept constant, and hence this proposal does not involve a change in dimension.
    [102] 2. Birth: create a new layer. Add a new Voronoi center with the position ck+1 found by choosing uniformly randomly a point from the underlying grid that is not already occupied. There are (Nk) discrete points available. Then, a new velocity value vk+1 needs to be assigned to the new layer. This is drawn from a Gaussian proposal probability density with the same form as (B1):
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0024
    where vi is the current velocity value at the depth ck+1 where the birth takes place. The variance θ22 of the Gaussian function is a parameter to be chosen.

    [103] 3. Death: Remove at random one layer by drawing a number from a uniform distribution over the range [1, k]. The response values of the neighboring cells remain unchanged.

    [104] 4. Move: Randomly pick one layer (from a uniform distribution) and randomly change the position of its nucleus according to
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0025
    [105] 5. Change the estimated data noise. Randomly select a hyper-parameter j from a uniform distribution over the range [1, m]. Propose a new value hj using
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0026

    [106] Note that the proposed model can fall outside the range defined by the uniform prior distribution. In this case, the prior (and hence the posterior) of the proposed model is null, and the model is rejected. In this way the proposal distributions are seen as true Gaussians rather than truncated versions.

    [107] The standard deviations (θ1, θ2, θ3, image of the Gaussian proposal functions are parameters to be fixed by the user. As shown by MacKay [2003], the magnitude of perturbations does not affect the solution but rather the sampling efficiency of the algorithm. Thus the width of proposal distributions are tuned by trial-and-error in order to have an acceptance rate as close to 44% for each type of perturbation [Rosenthal, 2000].

    Appendix C:: The Acceptance Probability

    [108] Once a proposed model has been drawn from the distribution q(m′|m), the new model is then accepted with a probability α(m′|m), i.e. a uniform random deviate, r, is generated between 0 and 1. If r ≤ α, the move is accepted, the current model m is replaced with m′ and the chain moves to the next step. If r > α, the move is rejected and the current model is retained for the next step of the chain where the process is repeated. The acceptance probability, α(m′|m), is the key to ensuring that the samples will be generated according to the target density p(m|dobs). It can be shown [Green, 1995, 2003] that the chain of sampled models will converge to the transdimensional posterior distribution, p(m|dobs), if
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0027
    where the matrix J is the Jacobian of the transformation from m to m′ and is needed to account for the scale changes involved when the transformation involves a jump between dimensions [Green, 2003].The expression for α(m′|m) involves the ratio of the posterior distribution evaluated at the proposed model, m′ to the current model m multiplied by the ratio of the proposal distribution for the reverse step, q(m | m′), to the forward step, q(m′ | m).

    C1. Proposal Ratios

    [109] The proposal ratio of forward and reverse moves needs to be calculated so that the acceptance probability in (C1) can be calculated in each case. For the proposal types that do not involve a change of dimension the distributions are symmetrical. That is, the probability to go from m to m′ is equal to the probability to go from m′ to m. Hence
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0028
    and in all three cases the proposal ratio equals one.
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0029
    [110] For a birth step, the algorithm jumps between a model m with k layers to a model m′ with (k + 1) layers. Since the new nucleus ck+1 is generated independently from the new velocity value vk+1 then proposal distribution can be separated and we write
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0030
    Specifically we have the probability of a birth at position ck+1 which is given by
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0031
    the probability of generating the new velocity value vk+1 is given by
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0032
    the probability of deleting the cell at position ck+1 (reverse step)
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0033
    and the probability of removing a velocity when cell is deleted (reverse step)
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0034
    Substituting these expressions in (C4) we obtain
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0035
    [111] For the death of a randomly chosen nucleus, we move from k to (k−1) cells. Suppose that nucleus ci, with velocity vi, is removed. In this case, a similar reasoning to the birth case above leads us to a proposal ratio (reverse to forward) of
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0036
    where vj is the velocity at depth ci in the new structure, c′, after removal of the ith layer.

    C2. The Jacobian

    [112] The Jacobian term “normalizes” the difference in volume between two spaces of different dimension. In our case, the Jacobian only needs to be calculated when there is a jump between two models of different dimensions, i.e. when a birth or death is proposed [Green, 1995]. If the current and proposed model have the same dimension, the Jacobian term is 1, and can be ignored.

    [113] For a birth step, the bijective transformation used to go from m to m′ can be written as
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0037
    The random variable uc used to propose a new nucleus ck+1 is drawn from a discrete distribution defined on the integers [0, 1, …, N − k]. The random number uv is drawn from the Gaussian distribution centered at 0 and the velocity assigned to the new cell is given by
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0038
    where vi is the current velocity value where the birth takes place.
    [114] Note that the model space is divided into a discrete space (nuclei position) and a continuous space (velocities). uc is a discrete variable used for the transformation between discrete spaces and uv is a continuous variable used for the transformation between continuous spaces. Denison et al. [2002] showed that the Jacobian term is always unity for discrete transformations. Therefore, the Jacobian term only accounts for the change in variables from
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0039
    Hence, here the Jacobian term is the determinant of the matrix of all first-order partial derivatives of the vector v′ with respect to (v,uv), and we have
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0040

    [115] So it turns out that for this style of birth proposal the Jacobian is also unity. Since the Jacobian for a death move is |J|death = |J−1|birth, this is also equal to one. Conveniently, the Jacobian is unity for each case and can be ignored.

    C3. The Acceptance Term

    [116] We now substitute expressions for each proposal ratio into (C1) to get final expressions for the acceptance probability in each of the 5 possible moves described earlier. For the moves that do not include a change in dimension, we have seen that the proposal ratio becomes unity. Hence the acceptance term is simply given by the ratio of the posteriors
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0041
    Since the dimension of the model does not change, according to (A9), the prior ratio is either null or unity. If one of the proposed parameter falls outside the bounds defined by the prior, the prior ratio is null and α(m′,m) = 0. Otherwise,
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0042
    For changes in velocity value and nuclei positions, we have
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0043
    When perturbing the data noise hyper-parameters h that define the data noise covariance matrix Ce, the normalizing constant in the likelihood is changed and the ratio of determinants needs to be taken into account (see Appendix D).
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0044
    Note that Φ(m′) and Φ(m) also incorporate C′e and Ce, respectively.
    [117] For a birth step, according to (A9), and provided the perturbed parameter falls into the bounds of the prior, the prior ratio takes the form
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0045
    After substituting (5), (C9), and (C19) into (C1), the acceptance term for the birth step reduces to
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0046
    where i indicates the layer in the current tessellation c that contains the depth ck+1 where the birth takes place. Again, if the perturbed parameters fall outside the bounds of the prior, then α(m′,m) = 0, and the move is rejected. For the birth step then we see the acceptance probability is a balance between the proposal probability (which encourages velocities to change) and the difference in data misfit which penalizes velocities if they change so much that they degrade fit to data.
    [118] For the death step, the prior ratio in (C19) must be inverted. After substituting this with (5) and (C10) into (C1), and after simplification we get the acceptance probability
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0047
    where i indicates the layer that we remove from the current tessellation c and j indicates the cell in the proposed tessellation c′ that contains the deleted point ci. Unsurprisingly the death acceptance probability has a similar form to that of the birth, with proposal and data terms opposing each other.

    [119] We see from these expressions that the variable N, i.e. the number of candidate positions for the nuclei, vanishes. This means that there is no need to use an actual discrete grid in generating nuclei positions. In fact it was only ever a mathematical convenience which ensures that the acceptance expressions have the correct analytic form. In practice we are at liberty to generate the nuclei using a continuous distribution over the region of the model (which is tantamount to N → ∞).

    Appendix D:: Parameterizing Ce

    D1. First Type of Noise Parameterization

    [120] The correlation function in (8) is assumed to decay exponentially and is thus given by ci = ri, where r = c1 is a constant number between 0 and 1 which describes the correlation between two adjacent samples in the time series. This correlation function is plotted in Figure D1a for different values of r, and realizations of noise with such a correlation are shown in Figures D1c, D1e, and D1g. In this case, the data noise covariance matrix in (7) writes:
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0048
    where n is the number of data points.
    Details are in the caption following the image
    (a and b) The two types of correlation functions ci for different values of r. These symmetric functions represent the correlation between a point in the time series and its neighbors. In order to compare the two forms of noise assumed in this study, we plot different realizations of noise for different values of r and a fixed standard deviation σ = 1. (c, e, and g) Different realizations with the first type of correlation (exponential decay), whereas (d, f, and h) noise vectors are generated with a Gaussian correlation. The second type of noise in Figures D1b, D1d, D1f, and D1h seem to be closer to what is observed on RF before the first arrival, however this way of parameterizing the noise turns out to be more difficult to implement for an Hierarchical Bayes inversion.

    [121] Hence, with this type of noise parameterization, the two hyper-parameters h = [σ, r] describing the noise covariance can be given a wide uniform prior probability distribution and posterior inference can be done to infer the magnitude and correlation of data noise. The two noise parameters are perturbed along the transdimensional Markov chain and each time a new value is proposed, Ce−1 and |Ce| will be perturbed accordingly to compute the likelihood value of the proposed model in (5).

    [122] It can be easily shown with linear algebra that the inverse of Ce is a symmetric tridiagonal matrix:
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0049
    See Malinverno and Briggs [2004] for a detailed demonstration (note that when r = 1 all the elements of Ce are one and the inverse does not exist). The inverse data noise covariance matrix requires storage that is proportional to n, while computing the Mahalanobis distance in (4) only requires order n operations. The likelihood in (5) also needs the determinant of Ce. As shown by Malinverno and Briggs [2004], an expression of this determinant can be obtained by writing the tridiagonal inverse covariance matrix Ce−1 = LLT, where L is a lower triangular matrix whose determinant is the product of its diagonal elements. The final result for the determinant of the data noise covariance matrix is
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0050

    [123] With this form of correlation, the data noise covariance matrix is well-conditioned and there are stable analytical solutions for Ce−1 and |Ce|. However, as shown below, this is usually not the case if we use other forms of correlation function. Here the main advantage is that each time we want to perturb r or σ along the random walk, we directly perturb the determinant and inverse without having to numerically compute them from a perturbed Ce, which would be too computationally expensive.

    D2. Second Type of Noise Parameterization

    [124] The data noise covariance matrix can be also parameterized with a Gaussian correlation law ci = image Here r can be written as r = image where ρ2 is the variance of a Gaussian correlation function, which is shown in Figure D1b. Realizations of such a noise for different values of r are shown in Figures D1d, D1f, and D1h. Note that a white noise which has been convolved with a Gaussian filter will have exactly the same structure as our second type of noise.

    [125] Although this form of correlation clearly appears more relevant for our problem, it turns out that the associated data noise covariance matrix in (7) is highly ill-conditioned, and hence there are no stable analytical formulation for its inverse and determinant. Therefore Ce−1 and |Ce| have to be numerically computed with SVD decomposition and removal of a large number of small eigenvalues that destabilize the process. Unfortunately an SVD decomposition of a n × n matrix is computationally expensive and cannot be carried out each time Ce is perturbed along the random walk. As a result, the correlation r need to be fixed and cannot be treated as an unknown in the inversion. However, the magnitude of data noise σ2 can be perturbed without having to re-invert each time Ce. This is because we have
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0051
    urn:x-wiley:01480227:media:jgrb16993:jgrb16993-math-0052
    In this way R−1 is computed once at the beginning and remains fixed along the Markov chain. The variance of data noise σ2 can be treated as an unknown since each time a new value is proposed, Ce−1 and |Ce| can be computed from (D4) and (D5) without having to redo any SVD decomposition.