Volume 56, Issue 1 e2018WR024240
Research Article
Free Access

Incorporating Posterior-Informed Approximation Errors Into a Hierarchical Framework to Facilitate Out-of-the-Box MCMC Sampling for Geothermal Inverse Problems and Uncertainty Quantification

Oliver J. Maclaren

Corresponding Author

Oliver J. Maclaren

Department of Engineering Science, The University of Auckland, Auckland, New Zealand

Correspondence to: O. J. Maclaren and R. Nicholson,

[email protected];

[email protected]

Search for more papers by this author
Ruanui Nicholson

Corresponding Author

Ruanui Nicholson

Department of Engineering Science, The University of Auckland, Auckland, New Zealand

Correspondence to: O. J. Maclaren and R. Nicholson,

[email protected];

[email protected]

Search for more papers by this author
Elvar K. Bjarkason

Elvar K. Bjarkason

Department of Engineering Science, The University of Auckland, Auckland, New Zealand

Search for more papers by this author
John P. O'Sullivan

John P. O'Sullivan

Department of Engineering Science, The University of Auckland, Auckland, New Zealand

Search for more papers by this author
Michael J. O'Sullivan

Michael J. O'Sullivan

Department of Engineering Science, The University of Auckland, Auckland, New Zealand

Search for more papers by this author
First published: 03 January 2020
Citations: 7


We consider geothermal inverse problems and uncertainty quantification from a Bayesian perspective. Our main goal is to make standard, “out-of-the-box” Markov chain Monte Carlo (MCMC) sampling more feasible for complex simulation models by using suitable approximations. To do this, we first show how to pose both the inverse and prediction problems in a hierarchical Bayesian framework. We then show how to incorporate so-called posterior-informed model approximation error into this hierarchical framework, using a modified form of the Bayesian approximation error approach. This enables the use of a “coarse,” approximate model in place of a finer, more expensive model, while accounting for the additional uncertainty and potential bias that this can introduce. Our method requires only simple probability modeling, a relatively small number of fine model simulations and only modifies the target posterior—any standard MCMC sampling algorithm can be used to sample the new posterior. These corrections can also be used in methods that are not based on MCMC sampling. We show that our approach can achieve significant computational speedups on two geothermal test problems. We also demonstrate the dangers of naively using coarse, approximate models in place of finer models, without accounting for the induced approximation errors. The naive approach tends to give overly confident and biased posteriors while incorporating Bayesian approximation error into our hierarchical framework corrects for this while maintaining computational efficiency and ease of use.

Key Points

  • We consider geothermal inverse problems and uncertainty quantification from a Bayesian perspective
  • We present a simple method for incorporating posterior-informed approximation errors into a hierarchical Bayesian framework
  • Our method makes standard out-of-the-box MCMC sampling feasible for more complex models while correcting for bias and overconfidence

1 Introduction

Computational modeling plays an important role in geothermal reservoir engineering and resource management. A significant task for decision making and prediction in geothermal resource management is so-called inverse modeling, also known as model calibration within the geothermal community, and as solving inverse problems in applied mathematics. Calibration consists of determining parameters compatible with measured data. This is in contrast to so-called forward modeling in which a simulation is based on known model parameters. Comprehensive reviews of geothermal modeling, including both forward modeling and model calibration, are given by O'Sullivan et al. (2001) and O'Sullivan and O'Sullivan (2016).

The primary parameters of interest in geothermal inverse problems include the anisotropic permeability of the subsurface and the location and strength of so-called deep upflows/sources. Knowledge of the values of these parameters allows for forecasts of, for example, the temperature and pressure down drilled, or to be drilled, wells, to be made. On the other hand, the available (i.e., directly measurable) quantities are instead typically temperature, pressure, and enthalpy at observation wells (O'Sullivan et al., 2001; O'Sullivan & O'Sullivan, 2016). A typical geothermal inverse problem for a natural, that is, steady state, preexploitation model then consists of, for example, estimating formation permeabilities based on temperature and/or pressure measurements at observation wells.

The predominant method used to solve geothermal inverse problems is still manual calibration (Burnell et al., 2012; Mannington et al., 2004, 2004; O'Sullivan & O'Sullivan, 2016; O'Sullivan et al., 2009), although it is well recognized that this is far from an optimal strategy. To address this situation, there has been a concerted effort to automate the calibration process. For example, software packages such as iTOUGH2 (Finsterle, 2000) and PEST (Doherty, 2015) have been developed, and used, for geothermal model calibration. These packages are primarily based on framing the inverse problem as one of finding the minimum of a regularized cost, or objective function; though essentially deterministic, approximate confidence (or credibility) intervals for model parameters can be constructed from local cost function derivative information (Aster et al., 2018). Even for optimization-based approaches to geothermal inverse problems, computations can be expensive and improvements are required to speed up the process. We recently proposed accelerating optimization-based solution methods using adjoint methods and randomized linear algebra (Bjarkason, 2019; Bjarkason et al., 2018, 2019).

Bayesian inference is an alternative to optimization-based approaches and is instead an inherently probabilistic framework for inverse problems (Kaipio & Somersalo, 2005; Tarantola, 2004; Stuart, 2010). This naturally allows for incorporation and quantification of uncertainty in the estimated parameters; when posed in the Bayesian setting, the solution to the inverse problem is an entire probability density over the parameters. Here we adopt a hierarchical Bayesian approach in particular, where we use “hierarchical Bayes” in the sense of Berliner, (1996, 2003, 2012). This approach is discussed in detail in section 3. The key to the method proposed here is incorporating approximation errors between an accurate and a coarse model as a component in our hierarchical framework, by adapting the Bayesian approximation error (BAE) approach (Kaipio & Somersalo, 2005; Kaipio & Kolehmainen, 2013). This allows us to speed up computation of parameter estimates while avoiding overconfidence in biased estimates by accounting for the approximation errors induced when coarsened models are used. The trade-off for improved computation time is modified posteriors with inflated variance relative to the ideal target posterior.

There is only a relatively small amount of literature taking a fully Bayesian approach to geothermal inverse problems (e.g., Cui et al., 2011; Cui, Fox, O'Sullivan, & Nicholls 2019; Cui, Fox, & O'Sullivan 2019; Maclaren et al., 2016) where by “fully Bayesian” we mean sampling (or otherwise computing) a full probability distribution rather than calculating a single point estimate and making local approximations to the posterior covariance matrix. We previously presented a hierarchical Bayesian approach to frame the inverse problem and used a generic sampling method to solve the resulting problem (Maclaren et al., 2016). On the other hand, Cui et al. (2011) and Cui, Fox, and O'Sullivan (2019) developed a more sophisticated adaptive sampling scheme based on using a coarsened model and a fine model. The present work is based on extending the hierarchical Bayesian framework of Maclaren et al. (2016) to explicitly use approximate models while being independent of which sampling scheme is used and straightforward to implement.

2 Background: The Bayesian Approach to Inverse Problems

The Bayesian framework for inverse problems allows for systematic incorporation and subsequent quantification of parameter uncertainties (Kaipio & Somersalo, 2005; Stuart, 2010), which can then be propagated through to model predictions. In this framework, the solution to the inverse problem is an entire probability distribution, that is, the posterior probability distribution, or simply the posterior. Both epistemic (knowledge-based) and aleatoric (actually random) uncertainties are represented using the same probabilistic formalism in Bayesian inference.

Calculation of the posterior relies on Bayes' theorem, written here as
where urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0002 denotes the parameters of interest and urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0003 denotes measured data, such as downhole temperatures. Here our parameters of interest here are rock permeabilities; though we work with log permeabilities throughout, for simplicity we will generally refer to these simply as “permeabilities.” The above is written as a proportionality relationship leaving out a normalization factor that is not required for most sampling algorithms (Gelman et al., 2013). In the above, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0004 is termed the likelihood and urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0005 is the prior.

A drawback of the fully Bayesian approach is the intensive computational cost that is usually required to apply Bayes' theorem, especially in the case of complex models such as in the geothermal setting (see, e.g., Cui et al., 2015). The dominant cost is repeated evaluation of the forward model and thus coarsened or surrogate models are often used in place of the most accurate forward model (see, e.g., Asher et al., 2015). Furthermore, the use of coarsened or surrogate models can help alleviate numerical instabilities (Doherty & Christensen, 2011). However, replacement of an accurate model with a surrogate invariably results in so-called approximation errors, which, if not accounted for, can lead to parameters and their associated uncertainty being incorrectly estimated (see, e.g., Doherty & Welter, 2010; Kaipio & Somersalo, 2007; Kennedy & O'Hagan, 2000). Next we give a brief overview of the main approaches in the literature for accounting for these errors. We then discuss how we incorporate these ideas into a hierarchical framework.

2.1 Approximation Errors and Model Discrepancies

In the Bayesian viewpoint, approximation errors can be treated as a further source of uncertainty. There are two standard approaches for doing this, that is, dealing with approximation errors: that based on the work of Kennedy and O'Hagan (2000) (referred to as KOH hereafter) and the BAE approach proposed by Kaipio and Somersalo (2005). The underlying principles of both approaches are similar, though with some implementation and philosophical differences. In particular, the KOH method was explicitly developed to account both for the difference between “reality” and a given simulation model and to allow for efficient emulation of computationally expensive models at arbitrary values (Higdon et al., 2004, 2008; Kennedy & O'Hagan, 2000). The typical KOH method is based on infinite-dimensional Gaussian process models: one to model the difference between reality and the simulation model and one to represent the output of the simulation model at new input values. Usually, only one physically based model is used (Higdon et al., 2004, 2008).

The BAE approach, in contrast, is based on two physically based simulation models: one which represents the “best,” but typically very expensive model, and one representing a coarser model, which nevertheless preserves the key physics of the problem. Furthermore, the approximation errors between the two physically based models are represented by a finite-dimensional multivariate Gaussian distribution, defined only at the locations of interest. The statistics of the approximation errors are directly estimated empirically, based on a small number of simulations of both the accurate and coarse models, and structural constraints are not typically placed on the form of the covariance matrix (Kaipio & Kolehmainen, 2013). While differences between the fine model and coarse model in the BAE approach are generally considered “approximation” errors between two different models, rather than “discrepancies” between reality and a model as in the KOH approach, these approximation errors typically include significant correlation structure, and the approach has been shown to work well in physical experiments (see, e.g., Lipponen et al., 2011, 2008; Nissinen et al., 2010). Additional systematic error can also be incorporated in the BAE approach in a straightforward manner; that is, it can directly incorporate correlation structure for both the error between the fine model and the data and in the error between the fine model and the coarse model.

The BAE approach is particularly simple to implement and, given two physical models, requires less user input in terms of parameters and hyperparameters than the KOH approach. For a further discussion and comparison of the the two methods see Fox et al. (2013). In this work we use (a variant of) the BAE approach; in contrast to past work in this area, however, we explicitly incorporate the approximation errors into a hierarchical framework. We discuss this next.

3 Hierarchical Framework

Here we outline our hierarchical Bayesian framework and where approximation errors enter. Implementation details are given in the following sections.

As described in Maclaren et al. (2016), the hierarchical Bayesian approach generally begins by assuming a three-stage decomposition of a full joint probability distribution over all quantities of interest, written schematically as
These three stages correspond to a measurement model, process model, and parameter model, respectively. The process parameters and observation parameters include both parameters of interest, such as permeabilities, and parameters characterizing the probability distributions, such as covariance matrices. The above decomposition is not an identity of probability theory but instead contains plausible physical modeling assumptions about the conditional independencies separating measurement and process variables (Berliner, 1996, 2003, 2012). For example, the measurement model (first factor) is assumed to be independent of the process parameters, while the process model (second factor) is assumed to be independent of the observation parameters. In terms of our current problem variables this becomes
where urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0008 is the observable (hence noisy) data vector and urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0009 is the latent or “true” process vector and we have suppressed the distribution parameters and distribution subscripts in each stage for simplicity. The model approximation error enters into the above scheme as a probabilistic process error. Intuitively, we use a probabilistic model to capture the additional uncertainty introduced by using an approximate model in place of a more accurate model. This is despite the fact that both models are deterministic; we explain the nature of this approximation in the following sections. Note that in order to regain the standard form of Bayes' theorem, that is, equation 1, the process variable urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0010 can be marginalized (integrated) out to regain the likelihood, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0011. This procedure is described in detail below.

3.1 Representation Using Functional Relationships

An equivalent representation of the above factorization scheme can be given in terms of functional relationships between random variables. In particular, assuming additive error models, the measurement and process model components correspond to a two-stage decomposition of the form
where urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0013 is the measurement error (which may include correlations) and urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0014 represents an approximate process model, that is, one that introduces additional errors urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0015, which as alluded to previously still preserves the key physics. Again, these may include significant correlations, though we will assume that the two vectors urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0016 and urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0017 are independent of each other.
Combining the error models, and introducing the total error term urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0018, allows us to write the above two-level model as a single-level model:
where the total error typically has nontrivial correlation structure due to approximation errors and possibly also due to measurement errors.

3.2 Likelihood

The above relationships define our likelihood urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0020, with the probability model dependent on the probabilistic structure of the total error vector urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0021. This likelihood can be obtained conveniently by using the single-stage functional relationship in equation 5 involving the total error, along with formal marginalization over the total error:

The last step above follows when both error vectors are independent of the parameter, and urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0023 is used to denote the Dirac delta distribution, which places all mass at 0. The assumption of independence of the model error vector and the parameter vector is discussed in detail in section 4. We have also explicitly denoted which probability distribution is being evaluated using subscripts.

These steps can be considered as a change of variables from urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0024 to urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0025 via the delta method (Au & Tam, 1999; Khuri, 2004). As would be expected, this likelihood can also be obtained by marginalizing out the process variable in the factorization given in equation 3 and using the equivalent two-stage representation of the hierarchical model in equations 4, but the above derivation is slightly simpler.

3.3 Error Components in the Hierarchical Framework

Here we consider the two key sources of error, measurement error and process (approximation) error, in more detail.

3.3.1 Measurement Error

Our measurement model is assumed to be independent of the approximation errors, and takes the form
is the measurement error, where urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0028 due to our independence assumptions between the measurement and process error vectors.

In the two physically motivated cases considered in this paper we make the assumption that the measurement errors are also pairwise independent. However, this assumption is not required, and changing it simply changes the covariance matrix of the measurement errors (assuming they are Gaussian). A simple example using correlated measurement noise is provided in Appendix Appendix C.

3.3.2 Process Error

Process errors, that is, approximation errors, are introduced by using a coarse model in place of a finer, or more accurate, simulation model. The fine model is represented by a function urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0029 and the coarse model by a function urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0030. Here urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0031 and urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0032 are the (vectors of) fine- and coarse-scale parameters of interest; in our case the fine-scale and coarse-scale parameters will have the same dimension here, despite corresponding to different discretization grids. That is, both models share the same parameters, and thus, we will drop the explicit distinction between urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0033 and urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0034 in what follows and simply refer to both by urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0035 (but see Appendix Appendix A for a discussion of the relationship between fine-scale and coarse-scale parameter grids).

To model the process approximation error, we assume the true or latent process variable to be generated exactly by the fine-scale model; that is,
In Maclaren et al. (2016) we only used one model and essentially had urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0037 as our process model. Here we explicitly introduce both fine and coarse models and take into account that urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0038. To do this, we define the process model error variable by
Since both urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0040 and urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0041 are deterministic for a given urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0042, at this point urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0043 must be, too; this can be formally incorporated into the hierarchical model by again treating deterministic functions as delta distributions. Thus, we can write
which simply amounts to a deterministic change of variables (carried out, e.g., via the delta method; Au & Tam, 1999; Khuri, 2004).

3.3.3 Total Error

Our goal here is to compute the posterior for the parameters given the data
where the total error has been marginalized over and where urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0046 is the likelihood and urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0047 is the prior. From the above, we see the posterior can be written as
where the process error has now been absorbed into the likelihood. The resulting expression is hence simply a standard “measurement” likelihood function, written in terms of total error, multiplied by the prior. It does, however, require the distribution of the total error, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0049, to be known.

To construct a model of the total error, we (a) assume that the measurement error urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0050 is Gaussian and (b) approximate the process model error urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0051 as Gaussian. Both of these random vectors may in general exhibit significant correlations between their respective components, and this is accounted for in the present approach, but the two vectors are assumed independent of each other. That is, we assume process error and measurement error are independent of each other. This makes combining these two errors straightforward (as described in the next subsection). Ultimately, we determine whether these, and the other approximations used thus far, are reasonable based on whether they work in practice—for example, whether they recover good estimates of the true parameters in test cases and whether any available error distributions “look normal” when plotted (or, if desired, pass formal tests of normality).

4 Computation of Error Models: Standard, Composite, and Posterior Informed

Both the probabilistic process model urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0052 and process error model urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0053 are typically intractable to simulate from for more than a limited number of realizations, as both involve the expensive fine-scale model. This motivates using approximations to these distributions, and results in approximate posterior distributions relative to the ideal target. The goal of these approximations is not to accurately estimate the model error as such but to approximately model the effect of marginalizing over it. This is for the purpose of reducing the bias/overconfidence in parameter estimates that would result from just using the simpler model directly; some loss of precision/statistical efficiency is expected. Here we give an overview of how the standard, composite, and posterior-informed approximation error models are computed and discuss relevant related literature. Explicit algorithms are given in the following section.

4.1 Premarginalization

Due to the computational issues discussed above, in the standard BAE approach the statistics of the approximation errors are precomputed empirically via directly drawing samples from the prior distribution, without the use of Markov chain Monte Carlo (MCMC). Similarly, here we compute the statistics of the approximation errors via direct sampling, though from a (naive) posterior distribution rather than the prior distribution, which itself can be (and, here, was) computed by separate MCMC sampling. MCMC sampling methods are discussed in section 5.3.

Our approach has the advantage of allowing a set budget of fine model runs to be specified, as well as minimal implementation difficulty. In contrast, some recent MCMC sampling schemes explicitly estimate and incorporate approximation errors during the MCMC sampling process. Similarly to our proposed method, Cui et al. (2011) and Cui, Fox, and O'Sullivan (2019) consider carrying out MCMC sampling on models of distinct levels of discretization while accounting for the approximation error; however, they use an adaptive delayed acceptance BAE approach to build the approximation error model during the MCMC sampling. In the methods developed by Cui et al. (2011) and Cui, Fox, and O'Sullivan (2019) the accurate model is typically run for each MCMC sample accepted based on the coarse model. This can make it more difficult to control the number of fine model runs used. While it is possible in principle to further modify the MCMC scheme used to incorporate such constraints, our approach offers a simple and direct way of controlling the number of fine model runs used.

Xu et al. (2017), Zhang et al. (2018), and Lødøen and Tjelmeland (2010) apply the KOH method to account for the approximation errors, and these are incorporated into an adaptive multifidelity MCMC sampler (see, e.g., Peherstorfer et al., 2018), the differential evolution adaptive metropolis (Vrugt et al., 2009; Laloy & Vrugt, 2012) sampler, and the Metropolis-Hastings algorithm (see, e.g., Chib & Greenberg, 1995), respectively. Again, these require more sophisticated understanding and control of the MCMC scheme used and involve infinte-dimensional stochastic processes following the approach of KOH. Here we provide a simple alternative based on the BAE approach to approximation error and involving finite-dimensional probability distributions only.

4.2 Standard Approximation Error

The standard BAE approach (Kaipio & Somersalo, 2005; Kaipio & Kolehmainen, 2013) is to first simulate a limited number of realizations from the true (i.e., involving the fine-scale model) joint distribution
using a given parameter prior urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0055 and then fit an approximate distribution urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0056 to the urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0057 realizations. This empirically estimated approximate distribution is then used as a plug-in replacement
in the hierarchical model. While the true points will lie on a surface of zero thickness, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0059 is estimated within a nondegenerate family of probability distributions, such as a multidimensional normal distribution. This procedure aims to “conservatively” cover the sample points, despite the obvious model misspecification (see Figure B1).

4.3 Enhanced, or Composite, Approximation Error

A further approximation is often used, which leads to what is called the “enhanced error model” in the BAE literature (Kaipio & Somersalo, 2005; Kaipio & Kolehmainen, 2013), which we will follow in the present work. This amounts to replacing the true joint distribution by the product of the empirically estimated—but true—marginal distributions:
which is estimated empirically based on samples as described in section 5. In the above, estimation of urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0062, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0063 is taken as the true conditional error distribution, and hence, the samples are used to estimate the true marginal. On the other hand, in all subsequent calculations the joint distribution is approximated by the product of the marginals. This is equivalent to using the marginal error distribution for urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0064 as a plug-in empirical estimator of the conditional error distribution for urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0065 in the hierarchical model, prior to subsequent inference steps. Importantly, this does not mean that the individual errors in the vector urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0066 are independent of each other, rather that the vector random variable urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0067 is independent of the vector random variable urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0068. The estimated errors urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0069 almost always exhibit significant correlations between components, and these are accounted for here.

As emphasized above, the goal is not to get the error exact, but to account for it in a somewhat “conservative” manner. While in the BAE literature this is referred to as the enhanced error model, the replacement of an intractable conditional distribution in a product of distributions by a more accessible marginal distribution is also similar in philosophy to that used in, for example, the composite likelihood literature (Varin, 2008; Varin et al., 2011). Hence, we will prefer to refer to it as the composite error model in the remainder of the text.

Finally, we note that after both the true marginal process model error has been empirically estimated, and the plug-in replacement has been made for the conditional distribution, the full process model error vector, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0070, is assumed to be (formally) conditionally independent of the full parameter vector, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0071, in any subsequent manipulations of the probability distributions.

4.4 Posterior-Informed Composite Approximation Error

Another practical issue with both of the above approximation procedures (i.e., both the full and the composite error models) arises in complicated models such as those in geothermal reservoir modeling (see, e.g., O'Sullivan & O'Sullivan, 2016): model run failures, long model run times, and/or extreme model outputs when sampling from an insufficiently informative prior and running the fine-scale model (in particular). We encountered a large number of such model run issues for the fine-scale model and were thus motivated to consider a further approximation to the process model error. This can be described as a posterior plug-in estimate of the model approximation error. In particular, we make the plug-in estimate
where we now use the coarse model posterior for the parameters to estimate the error distribution marginalized over the parameter. That is, we use
which is estimated empirically based on samples as described in section 5 and where
and urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0075 is the likelihood function based on the coarse-scale model urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0076. Since we did not encounter model run issues in the coarse model, we can estimate this by combining the likelihood with the broad prior.
Once the error distribution has been estimated, we again use the composite model of the joint distribution, along with the original prior:

Thus, we are simply using a different plug-in estimate of the model error. Since this is now used with the coarse model we can revert to the broad prior without model run failures in all subsequent calculations.

Again, because our goal is not to model the error exactly, but rather to model the effect of marginalizing over it, we are willing to tolerate more potential inaccuracies at this stage. The present step of using posterior sampling for the approximation error is “riskier” than that in the previous section, however, in the sense that it involves a formal “double use of data” and tends to narrow rather than widen the error distribution, when compared to the distribution that results from using the prior. A geometric interpretation of this posterior model approximation step, and its potential dangers, is given in Appendix Appendix B.

Despite the above warnings, we believe that the use of posterior approximation errors, as described in the present work, is often a practical solution in complex models. It also has the benefit of providing more “relevant” estimates of the model error when the posterior based on the coarse model is not too far from the true posterior. One way to check this assumption would be to recompute the model error distribution under the final posterior and compare it to the error distribution computed under the coarse model posterior; checking for similarity of these distributions can be thought of as a form of posterior predictive check (see, e.g., for a good general discussion of posterior predictive checks; Gelman et al., 2013). This check does, however, require recomputing realizations from the fine-scale model and so is not always practical.

5 Statistical Algorithms

By taking a Gaussian approximation of the process error, we can characterize its distribution with the mean and covariance only. As discussed above, these cannot be computed analytically in general and thus must be estimated empirically via samples. In this section we give algorithmic details for both the standard composite error model approach and our proposed posterior-informed composite error model approach. Pseudocode is provided for both of the methods. We also outline the MCMC method used for sampling the resulting target posterior.

5.1 The Standard Composite Error Model Approach

To calculate the statistics of the the process error, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0078, in the standard composite error model approach, an ensemble of urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0079 samples are drawn from the prior distribution urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0080, say, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0081, for urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0082. Both the fine and coarse models are then run for these samples, resulting in an ensemble of approximation errors:
The ensemble mean and covariance of the approximation errors are then estimated:
As discussed above, the total error, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0085, is the sum of both the noise and the process model error, and thus, the distribution for the total error (given the normality assumption) is given by

This new distribution is then used to update the likelihood, which consequently updates the posterior density.

Algorithm 1 gives pseudocode for the standard composite error model approach for constructing the distribution of the total errors and for carrying out the inversion.

5.2 The Proposed Posterior-Informed Composite Error Model Approach

In the approach proposed here, we avoid sampling from the prior density of urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0087 to generate the ensemble urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0088, to avoid model failures and extreme run times. Instead, we initially construct a naive posterior density of urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0089, denoted urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0090, (done here) using MCMC with the likelihood function induced by the noise term, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0091, only, and using the coarse model, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0092. This results in samples from the naive posterior, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0093, which are then passed through the two models to construct the process model errors, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0094. Once these samples for urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0095 have been generated, the method is essentially the same as that of the standard composite error model approach.

Pseudocode for the proposed posterior-informed composite error model approach is given in Algorithm 2.

5.3 MCMC Sampling

In the present work, MCMC sampling is carried out using the Python package emcee (Foreman-Mackey et al., 2013). This package implements an affine invariant ensemble sampler (Goodman & Weare, 2010), with the benefit of being easy to implement for arbitrary user-defined models. It also allows for easy communication with the PyTOUGH Python interface (Croucher, 2011) to TOUGH2 (Pruess et al., 1999) and AUTOUGH2 (Yeh et al., 2012) (The University of Auckland's own version of the TOUGH2) for carrying out the forward simulations.

For large-dimensional problems the affine invariant ensemble sampler may be inadequate (Huijser et al., 2015), in which case, alternative out-of-the-box samplers like those available in Stan (Carpenter et al., 2017), or PyMC (Patil et al., 2010) could be used. However, as alluded to earlier, the approach outlined here is essentially independent of the choice of particular MCMC sampler, providing flexibility in the choice of MCMC sampling scheme used while also being compatible with nonsampling, optimization-based methods.

6 Computational Studies

We consider multiphase nonisothermal flow in a geothermal reservoir, including both two-dimensional and three-dimensional reservoir case studies.

6.1 Governing Equations for Geothermal Simulations

Our general problem is governed by the mass balance and the energy balance equations:
respectively (Pruess et al., 1999). Here urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0098 is the control volume with boundary urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0099, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0100 denotes an outward pointing unit normal vector to urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0101, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0102 and urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0103 represent amount of mass per unit volume (kg/m urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0104) and amount of energy per unit volume (J/m urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0105), respectively, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0106 and urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0107 are the mass flux (kg/(m urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0108s)) and energy flux respectively (J/(m urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0109s)), while urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0110 and urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0111 represent mass sinks/sources (kg/(m urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0112s)) and energy sinks/sources (J/(m urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0113s)) respectively.
We consider observations of temperature only, while the parameters of interest are limited to (log) rock type permeabilities. Other parameters that may be of interest include deep sources, relative permeabilities, and porosities, while other observable quantities include production history pressure and enthalpy. The relationship between the permeabilities and temperature, that is, the parameter-to-observable map, can be understood by examining the key terms in 25 and 26, following Cui et al. (2011). A more in-depth discussion is given in (Pruess et al., 1999). First, the amount of mass and energy per unit control volume are given by
respectively, where urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0116 is porosity (dimensionless), urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0117 denotes liquid saturation (dimensionless), urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0118 represents vapor saturation (dimensionless), urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0119 signifies the density of the liquid (kg/m urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0120), urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0121 is the vapor density (kg/m urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0122), urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0123 represents density of the rock (kg/m urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0124), urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0125 denotes internal energy of the liquid, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0126 signifies internal energy of the vapor (J/kg), urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0127 is the specific heat of the rock (J/kg urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0128K), and urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0129 denotes temperature (K). Next, the mass flux is given by the sum of the mass flux of liquid and the mass flux of vapor:
Here urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0133 represents the permeability tensor (m urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0134), urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0135 is pressure (Pa), urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0136 and urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0137 are kinematic viscosity of liquid (m urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0138/s) and vapor (m urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0139/s), respectively, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0140 and urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0141 signify relative permeabilities (dimensionless), while urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0142 denotes gravitational acceleration (m/s urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0143). Finally, the energy flux is given by
where urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0145 and urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0146 are specific enthalpies (J/kg) of liquid and vapor, respectively, and urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0147 represents thermal conductivity (J/(K urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0148m urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0149s)). In our study, all parameters other than rock permeabilities were taken as known.

6.2 Model Setup and Simulation

We consider two scenarios as case studies—the first is based on a synthetic two-dimensional slice model, while the second is based on the Kerinci geothermal system, Sumatra, Indonesia. Each case study involves both a fine model and a coarse model, and thus, in total, we have four computational geothermal models in this work.

In all cases we solve the forward problem using the computer package AUTOUGH2 (Yeh et al., 2012), The University of Auckland's version of the TOUGH2 (Pruess et al., 1999) simulator, with the pure water equation of state model, that is, EOS1. We only consider steady state conditions, though, as standard, we calculate steady states via time marching to assist convergence to proper model solutions.

The parameters of interest in both case studies are rock permeabilities, which are associated to a given rock type. There has been some work on allowing a distinct rock type for each cell in the computational model (Bjarkason et al., 2018, 2019; Cui et al., 2011; Cui, Fox, & O'Sullivan 2019). However the standard approach in geothermal modeling and inversion (Fullagar et al., 2007; O'Sullivan & O'Sullivan, 2016; Popineau et al., 2018; Witter & Melosh, 2018), and the approach taken here, is to base the simulation model on a conceptual model of the geological structure. The simulation hence respects the lithologic boundaries of these geological models. Mathematically, this is equivalent to regularization by discretization, see, for example, Kaipio and Somersalo (2005) or Aster et al. (2018), and is a way of incorporating important prior information. The present approach can allow for arbitrary assumptions on the rock type structure, though at the cost of higher dimensionality and/or increased ill posedness. We aim to investigate the effects of including uncertainty in geological structure in future studies.

6.2.1 Case Study I: Slice Model

For this case study we consider a two-dimensional slice model, shown in Figure 1, based on that considered in Bjarkason et al. (2016) and Maclaren et al. (2016).

Details are in the caption following the image
Rectangular slice model geometry showing the rock type locations, independent of computational grid discretization. Each rock types is represented by a different color. The location and magnitude of the deep source in the lower left corner are assumed to be known.

The model geometry is a rectangular slice with physical dimensions of 1,600 m deep and 2,000 m wide. For our test problem we restricted the unknowns to a set of 12 parameters, two each for six rock-type regions, where these regions are assumed known in the present work. The location and intensity of the source are also assumed known. All six rock types are assumed to have the same porosity (10%), rock grain density (2,500 kg/m urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0150), thermal conductivity (2.5 W/(m urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0151K)), and specific heat (1.0 kJ/(kg urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0152K)). The top boundary condition consists of constant pressure of 1 atm and constant temperature of 15 °C. The bottom boundary condition consists of a constant heat flux of 80 mW/m urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0153, except at the bottom-left corner region (see Figure 1) where 7.5  urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0154 10 urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0155 kg/(s urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0156m urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0157) of a 1,200-kJ/kg enthalpy fluid is used as a deep source input. The side boundaries are closed.

The (noisy) measurements consist of temperatures taken at 15 depths down each of 7 vertical wells; this gives a total of 105 measurement points; see Figure A1. The synthetic data are corrupted by additive independent identically distributed mean zero Gaussian noise, which has a standard deviation of 5 °C.

We used two different computational discretizations, described in section 6.3.

6.2.2 Case Study II: Kerinci Model

For this case study we consider a three-dimensional model of the Kerinci geothermal system, Sumatra, Indonesia, shown in Figure 2. This is based on a model developed by Prastika et al. (2016). We briefly recap the key model features here; for full details see Prastika et al. (2016).

Details are in the caption following the image
The Kerinci model geometry with vertical section showing the rock-type locations, assumed known, based on the fine model but largely independent of computational grid discretization. Different colors represent different rock types. The model also includes a layer of atmospheric blocks (not shown). The location and intensity of the sources are assumed to be known.

The model geometry has physical dimensions of 16 km by 14 km (horizontal dimensions) by 5 km (depth). Our problem has a set of 30 parameters, 3 each for 10 rock-type regions, where these regions are assumed known in the present work. One of these “rock types” corresponds to an atmospheric layer, so we have 27 key parameters of interest to estimate. All nine nonatmospheric rock types are assumed to have the same porosity (10%), except for the rock labeled C0001 (representing pumice), which has a slightly higher porosity (12%). The rest of the properties of the nonatmospheric rock types were uniform, assumed to have the same rock grain density (2,500 kg/m urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0158), thermal conductivity (2.5 W/(m urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0159K)), and specific heat (1.0 kJ/(kg urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0160K)). The top boundary condition consists of constant pressure of 1 bar and constant temperature of 25 °C. Most of the bottom boundary consists of a constant heat flux of 80 mW/m urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0161, except for a small number of blocks specified as the locations of deep source input (see Prastika et al., 2016). The total flow rate of the deep source input is 100 kg/s, split into 70 kg/s of fluid with an enthalpy of 1,400 kJ/kg, and 30 kg/s of fluid with an enthalpy of 1,100 kJ/kg.

Measurements consisted of a total of 17 temperature measurements taken down three wells. We assume that the data are corrupted by additive independent identically distributed mean zero Gaussian noise with a standard deviation of 10 °C.

We again used two different computational discretizations, described in section 6.3.

6.3 Approximation Error Computations

For each case study, we calculated the statistics of the approximation errors by using the AUTOUGH2 simulator. The same process was used in each case, though slightly different numbers of simulations were used for each case study. We outline the general process below while indicating any differences between case studies.

6.3.1 Calculation Steps

In each case study, to calculate the statistics of the approximation errors, we simulated both the fine model, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0162, and the coarse model, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0163, 1,000 times each using AUTOUGH2. These simulations were taken over the naive posterior, which was first generated by running MCMC using the coarse model and without accounting for the approximation errors

For the slice model scenario, the naive posterior was constructed from 150,000 samples generated by MCMC, while for the Kerinci scenario we generated 90,000 samples. The statistics of the approximation errors were then calculated, as described above, by running the coarse and fine models on 1,000 samples randomly selected from the full set of 150,000 (slice model) or 90,000 (Kerinci) naive posterior samples.

For the slice model scenario the fine model geometry consisted of a square grid of 81  urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0164 100 = 8,100 blocks (including one layer of atmospheric blocks), and the coarse model consisted of a grid of 17  urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0165 20 = 340 blocks (again including one layer of atmospheric blocks). These model grids are shown in Appendix Appendix A.

For the Kerinci model scenario the fine model geometry consisted of 5,396 blocks (including one layer of atmospheric blocks) and the coarse model consisted of 908 blocks (again including one layer of atmospheric blocks). These model grids are again shown in Appendix Appendix A.

In each case study we ensured consistency of measurement locations using functionality of PyTOUGH described in O'Sullivan et al. (2013), which allows the same observation wells to be defined independently of grid resolution.

6.4 MCMC Computations

For the slice model scenario (and both with and without incorporation of the approximation errors) 150,000 samples were computed (an ensemble of 300 walkers taking 500 samples each) after discarding an initial 30,000 burn-in samples.

For the Kerinci scenario (both with and without incorporation of the approximation errors) 90,000 samples were computed (6 ensembles of 300 walkers taking 50 samples each) after discarding a total of 30,000 burn-in samples (5,000 for each ensemble).

All computations were carried out on a standard desktop computer with an AMD Ryzen 5 1600 3.2-GHz 6-Core Processor.

6.5 Computational Requirements of Forward Model and MCMC Simulation

In the slice model scenario, the fine model took approximately 1–5 min per simulation, while the coarse model took less than half a second per simulation, typically about 0.45 s. Thus, generating 150,000 samples using naive MCMC to construct the posterior distribution using the fine model would take around 100–500 days, whereas using the same number of samples to construct the approximation error informed posterior using the coarse model took just less than 20 hr. Only taking into account these MCMC runs, in the worst case this represents a speedup of at least a factor of 100.

In the Kerinci scenario, the fine model took approximately 30 s per simulation for well-behaved cases, but potentially several hours for less well behaved models. The coarse model typically took about 1–10 s per simulation but could take several minutes for less well behaved cases. The run times for both of these cases were much more variable for this model than for the slice model. We generated 90,000 samples by running six chains in parallel (see below for more detail) and then combining these. Generating 90,000 samples to construct the posterior distribution using the fine model and naive MCMC would take at least a year, and possibly up to a decade, whereas using the same number of samples to construct the approximation error informed posterior using the coarse model and naive MCMC (again run in six parallel batches) took about 12 days.

More sophisticated parallelization (Laloy & Vrugt, 2012; Vrugt et al., 2009), or use of gradient information (Carpenter et al., 2017; Patil et al., 2010), in the MCMC sampling algorithms could of course considerably change these timing estimates for full MCMC. Here we restrict attention to a particularly simple black-box MCMC sampler that can be easily coupled to AUTOUGH2 simulations. In general, however, we would still expect significant practical speedups in a range of realistic scenarios, as the BAE approach is suited to problems in which approximate premarginalization can be carried out with many less samples than required for full MCMC sampling.

In addition to the above rough timing estimates, the approximation error calculations further require both a naive posterior and the model approximation error statistics to be calculated. In each case, approximately the same amount of time was required to run full MCMC for the naive case and for the approximation error informed case. The key cost for all MCMC calculations is running the coarse model; only the statistics of the particular likelihood model differ. Thus, for the slice model, approximately 20 hr was required to run full MCMC for the naive case and another 20 hr for the approximation error informed case. For the Kerinci model, between 2 and 6 days would be expected to carry out naive MCMC in general; here we found it took about 2.5 days when sampling from the prior and running the naive model.

In contrast to the MCMC cases, simulations of both the coarse and accurate model are required to calculate the approximation errors statistics. For the slice model scenario we generated 1,000 samples by running 200 runs of each in parallel on five nodes; this also took just under 20 hr. Thus, the total time for inversions using (a naive version of) the approximation error approach is approximately 20  urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0166 3 = 60 hr. The worst-case effective speed-up factor compared to naive sampling is thus at least 30 in the present work, but typically more like 50–150.

For the Kerinci scenario we again used 200 runs of each model in parallel on five nodes. This took approximately 5.5 days. Thus, the total time for inversions using the approximation error approach was around 2.5 + 5.5 + 6 = 14 days (2 weeks). This again gives a speedup (compared to, e.g., 1–10 years) of at least about 30.

Natural ways to further increase the speedup of the approximation error calculations include, for example, only running an approximate, optimization-based sampler to generate the initial naive posterior (from which only 1,000 samples will be used). In our case, however, we simply ran full MCMC separately for both the naive and the approximation errors informed cases. This enabled us to give a relatively fair comparison of the results from these two models. Furthermore, the initially calculated naive posterior, or at least the second order statistics of this, could be used either to initialise the second (main) run of MCMC for our algorithm, or as a proposal distribution.

6.6 Availability

Our code was written in Python 2.7 using open source Python packages. It is available at GitHub (https://github.com/omaclaren/hierarchical-bae-manuscript).

An archived version of this code is available at Zenodo (http://doi.org/10.5281/zenodo.3509966).

Access to the AUTOUGH2/TOUGH2 (Pruess et al., 1999; Yeh et al., 2012) simulator is also required; we plan to adapt our code to use the new open-source Waiwera simulator (Croucher et al., 2018) when it is officially released.

The key functionality is implemented in a small library of object-oriented classes implementing the various components of the hierarchical framework.

7 Results and Discussion

Here we compare a series of inversion results for both the slice model and the Kerinci model scenarios. We focus on the results that are obtained, for each scenario, when using a coarse model without accounting for approximation errors and those obtained when using a coarse model when the approximation errors are accounted for. We consider both data space (posterior predictive) and parameter space (parameter posterior) distributions.

Particular emphasis is placed on (a) the feasibility of the posterior uncertainty estimates in parameter space, that is, the question of whether or not the posterior uncertainty is consistent with, that is, supports, the true ( urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0167) permeability values, and (b) the role of predictive checks with and without incorporation of the approximation errors.

7.1 Slice Model Scenario

Here we consider results from the slice model scenario.

7.1.1 Posterior Predictive Checks

In Figure 3 we show posterior predictive checks constructed by running the model on a subset of posterior samples obtained from MCMC. Realizations of the process model without measurement error are plotted in blue, while the data obtained from running the fine model and adding measurement error are shown in black. Figure 3a shows the posterior predictive check under the coarse model while neglecting the approximation errors.

Details are in the caption following the image
Posterior predictive checks for (a) the naive model without approximation error correction, (b) only the model approximation error correlations included, and (c) the approximation errors included, with both the model error correlations and offset (bias) terms.

As can be seen in the figure, the coarse model fits the data well and the uncertainties are small. Thus, this check does not flag any potential issue with naively using the coarse model. On the other hand, (b) and (c) show the predictive checks resulting from inference under the approximation error corrected model. In particular, (b) shows the results when only the covariance of the approximation errors are accounted for, while (c) shows the results when both the approximation error covariance and offset (bias) terms are included. Comparison of (b) and (c) shows that both error correlations and the bias term are important for obtaining a properly fitting model. More importantly, the difference in variation between (a) and (c) indicates that we are potentially underestimating the uncertainties involved in naively using the coarse model for inversion. Intuitively, the low-variance of the posterior is counterbalanced by the introduction of additional bias into the parameter estimates. This is illustrated in the next subsection.

An implication of these results is that, in general, posterior predictive checks against the original data do not appear to indicate issues that arise due to inversion under a reduced-order model. This is perhaps to be expected due to the ill-posed nature of inverse problems; that is, pure within-sample data fit checks are not sufficient to determine whether a model is appropriate. One potential fix for this is to either carry out checks on held-out data or, in our case, against a more expensive/accurate model, which effectively plays the role of held-out data.

7.1.2 Posterior Parameter Distributions

Here we consider the (marginal) parameter space posterior distributions, both for the naive and the approximation error informed models. Figure 4 shows the marginal posteriors of the permeability for each rock type and each direction and both with and without incorporation of the approximation errors.

Details are in the caption following the image
Marginal posteriors for parameters in which the naive and approximation error informed cases largely agree. The true parameter values are indicated using a dashed black line.

The first set of plots, in Figure 4, show the parameters for which fairly consistent results were reached by both the naive and approximation error informed models. The second and third sets of plots, shown in Figure 5 as two sets of urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0168 plots labeled by (a) and (b) for easier visual comparison, show cases where the results tend to conflict between the naive and the approximation error.

Details are in the caption following the image
Marginal posteriors for parameters in which the naive and approximation error informed cases tend to conflict. The true parameter values are indicated using a dashed black line. These plots are divided into two groups, labeled as (a) and (b), purely to aid visual comparison. Within each of these groupings, the posterior for each parameter is presented both above and below, with the posteriors above obtained without approximation error and those below showing the results incorporating approximation error.

As can be seen in Figure 5, naive inversion under the coarse model often results in essentially infeasible parameter estimates, that is, posteriors for which the truth is assigned only a low posterior probability density. On the other hand, the approximation error corrected case always assigns a high posterior probability density to the true parameters (though in some cases this is slightly lower than the density assigned under the naive case). In reality, of course, neither model will be correct, but it is hoped that the fine-scale model is a better reflection of the truth.

Some of the parameters appear to be effectively nonidentifiable, as indicated by the lack of updating when comparing the prior to posterior distributions (see for a systematic review and discussion of measuring statistical evidence in a Bayesian setting; Evans, 2015). This lack of identifiability can also be quantified using, for example, the Kullback-Leibler divergence; however, we prefer to present comparisons graphically, following the general Bayesian data analysis philosophy of Gelman et al. (2013). In particular urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0169 and urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0170 appear to be largely uninformed by the data. Physically, this could be explained by the fact that there is very little horizontal fluid flow in the cap rock and essentially all fluid in the outflow region is in the vertical direction. On the other hand, the remaining parameters appear to be reasonably well identifiable; some, in particular, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0171, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0172, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0173, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0174, and urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0175 appear to be strongly identified. Under the naive model, however, inversion for the strongly identifiable parameters gives posteriors that appear very well informed but are in fact providing effectively infeasible estimates. This provides another trade-off, where parameters that are strongly informed by the data under a model will hence tend be more strongly biased toward different values when estimated under a different model.

7.2 Kerinci Model Scenario

Here we consider results from the Kerinci model scenario.

7.2.1 Posterior Predictive Checks

In Figure 6 we show posterior predictive checks constructed by running the model on a subset of posterior samples obtained from MCMC. Realizations of the process model without measurement error are plotted in blue, while the data obtained from running the fine model and adding measurement error are shown in black. Figure 6a shows the posterior predictive check under the coarse model without incorporation of the approximation errors, while Figure 6b shows the results incorporating the approximation errors.

Details are in the caption following the image
Posterior predictive checks for (a) the naive model without using the approximation errors and (b) while using the approximation errors (both the model error correlations and offset/bias terms).

As can be seen, in this more realistic model, and under more extreme model simplification (the discretization is significantly reduced and simplified in the coarse model) the approximation errors can be quite large. The difference in variation between Figures 6a and 6b certainly indicates that we are likely underestimating the uncertainties involved in naively using the coarse model for inversion. Although the coarse model predictive check provides a tighter fit around the measured data, it also assigns much less probability density to at least one data point, so in this sense provides a worse fit to the data and in this case flags potential underfitting of the coarse model.

7.2.2 Posterior Parameter Distributions

Here we consider the (marginal) parameter space posterior distributions, both for the naive and the approximation error informed models; see Figures 7 and 8. For brevity we include a representative selection here; the remaining distributions as well as full corner plots (Foreman-Mackey, 2016) are given in the supporting information. The same basic patterns observed in the plots shown here can also be seen in the plots in the supporting information.

Details are in the caption following the image
Marginal posteriors for the rock permeabilities in the Kerinci model, in cases where the results largely agree between the naive case and the approximation error informed case.
Details are in the caption following the image
Marginal posteriors for the rock permeabilities in the Kerinci model, in cases where the results tend to conflict between the naive case and the approximation error informed case.

The first set of plots, in Figure 7, show the parameters for which fairly consistent results were reached by both the naive and approximation error informed models. The second set of plots, shown in Figure 8, shows cases where the results tend to conflict between the naive and the approximation error informed case. Here the true parameters are unknown and hence not shown.

7.3 Additional Comments

As we have noted above, standard MCMC sampling is much more computationally feasible for these geothermal inverse problems when using coarser models as opposed to finer, more accurate models. Importantly, however, we see that just naively using a coarse model without accounting for approximation errors tends to give overconfident and biased posteriors, for which the known true parameters can lie outside of the bulk of the support. On the other hand, taking into account the approximation errors leads to known true parameters lying inside the bulk of the support in all cases considered here. Both methods require effectively the same amount of computation time, though the BAE approach requires some additional initial computation to construct the model error statistics. This additional computational effort is the price paid to avoid misleading estimates and is still significantly less than attempting MCMC using the fine model.

In this paper we have only considered the use of relatively naive MCMC sampling to estimate the posterior density for the permeabilities, based on an approximation error informed coarse model. More sophisticated MCMC algorithms, for example, those utilizing parallelization (Laloy & Vrugt, 2012; Vrugt et al., 2009) and/or derivative information (Carpenter et al., 2017; Patil et al., 2010) would be expected to speed up the sampling significantly. In some settings, however, these more sophisticated forms of MCMC may still be computationally infeasible even using only the coarse model (with or without approximation errors included). In this case, the posterior approximation errors can still be constructed without MCMC, as long as some alternative method is available for drawing the (smaller) set of required samples from the naive posterior. For example, here we only required 1,000 samples from the naive posterior, compared to the 150,000 or 90,000 used for full MCMC runs. This would then enable the use of a coarse model which accounts for approximation errors alongside alternative sampling and/or optimization-based approaches.

8 Conclusions

We have demonstrated how to carry out simple yet computationally feasible parameter estimation and uncertainty quantification for geothermal simulation models by using a coarser, or cheaper, model in place of a finer, or more expensive, model. Our approach was to construct an approximation to the posterior Bayesian model approximation error and incorporate this into a hierarchical Bayesian framework. The hierarchical Bayesian perspective provides a flexible and intuitive setting for specifying assumptions on different model components and their combinations. In this view, approximations and modeling assumptions are directly incorporated into the framework by replacing joint distributions by factorizations in terms of simpler conditional and/or marginal distributions.

Our approach requires two simple initial computational steps in order to correct for the bias and/or overconfidence that would normally be introduced by directly using the coarse model in place of the finer model. These two steps then enable standard, out-of-the-box MCMC to be used to sample the parameter posterior using the coarse model. We demonstrated our approach can achieve significant computational speedups on both synthetic and real world geothermal test problems.

Our approach consists of three relatively simple steps overall and should be more accessible to general practitioners than having to manually implement more complex sampling schemes. Furthermore, the methods developed here should be generally applicable to related inverse problems such as, for example, those appearing in petroleum reservoir engineering and groundwater management.


The authors appreciate the contribution of the NZ Ministry of Business, Innovation and Employment for funding parts of this work through the Grant C05X1306Geothermal Supermodels. The authors would also like to thank Jari Kaipio for helpful discussions about Bayesian approximation error methods, Joris Popineau for visualizations of the Kerinci model, Ryan Tonkin for useful discussions on geothermal modeling, and the three reviewers for feedback that significantly improved this manuscript. Our code is available from GitHub (https://github.com/omaclaren/hierarchical-bae-manuscript) and is archived at Zenodo (http://doi.org/10.5281/zenodo.3509966).

    Appendix A: Mapping Between Fine and Coarse Grids

    To facilitate computation of the process model error, and following Kaipio and Kolehmainen (2013), in this study we have been implicitly assuming that the difference between the fine and coarse models can be approximated as

    Here urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0177 is a slight abuse of notation and still in fact represents the fine model evaluated on a fine parameter grid; however, the parameters are fixed to be homogeneous within a given rock type, matching the values in the corresponding rock types on the coarser grid. Thus, the parameter vectors have the same effective dimension (and values), equal to that of the coarse grid, and thus are in 1-1 correspondence. This is made clearer by comparing Figure A1 below to Figure 1 introduced earlier: Each mesh in Figure A1 represents a different discretization of the same underlying parameter grid given in Figure 1. This assumption means we can compute the approximation error by sampling the coarse parameters directly rather than the (larger-dimensional) fine parameters. Implicitly, however, this is neglecting some of the approximation error that would be induced by sampling over all fine parameter sets compatible with the given coarse parameter set. This assumption can be checked/removed to the extent that computational resources allow computing the error over the fine grid (Kaipio & Kolehmainen, 2013). Either way, the coarse grid parameters are the ultimate targets of inference, and by using the more conservative “enhanced” (or “composite”) error model based on the marginal error distribution, we can hope to account for some of this additional uncertainty indirectly.

    Details are in the caption following the image
    Computational grids used to simulate the geothermal slice system. The fine model geometry (a) consisted of a square grid of 81  urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0178 100 = 8,100 blocks (including one layer of atmospheric blocks), and the coarse model (b) consisted of a grid of 17  urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0179 20 = 340 blocks (again including one layer of atmospheric blocks). Observation wells are shown as blue vertical lines.

    The fine and coarse Kerinci models were related in the same manner, with only the mesh discretization varying. A top view of the two meshes in shown in Figure A2.

    Details are in the caption following the image
    Computational grids used to simulate the Kerinci geothermal system, showing only the top-down view for illustrative purposes. The fine model geometry (a) consisted of a 3-D system of 5396 blocks (including one layer of atmospheric blocks) and the coarse model (b) consisted of a 3-D system of 908 blocks (again including one layer of atmospheric blocks).

    Appendix B: Geometric View of BAE

    In Figure B1 below we give a geometric picture of both the standard prior-based and our posterior-based composite (enhanced) error model approach. In both cases we essentially aim to conservatively cover the deterministic functional relationship urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0180, or the associated degenerate joint distribution urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0181, by a probability distribution based on marginal distributions. In the posterior case, however, we restrict attention to estimating the error by sampling over the support of the naive posterior. As can be seen in the figure, the accuracy of this procedure depends on, for example, how well the naive posterior approximates the true posterior. Alternatively, if the error is approximately independent of the parameter, hence giving a horizontal line for urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0182, then both the prior and posterior error distributions would give the same delta distribution for the error, regardless of how well the naive posterior approximates the true posterior. Thus, intuitively, the procedure would be expected to be most reasonable when (a) the naive posterior approximates the true posterior reasonably well and/or (b) when the model error does not depend strongly on the parameter. This latter condition is already a condition for the usual enhanced/composite error model approach to provide a reasonable approximation, and so switching to the posterior composite error model is at least consistent with this assumption.

    Details are in the caption following the image
    Geometric interpretation of the enhanced/composite Bayesian model approximation error approach in both (a) the usual prior-based case and (b) our posterior-based approach. In both cases we essentially aim to conservatively cover the deterministic functional relationship urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0183, or the associated degenerate joint distribution urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0184, by a probability distribution based on marginal distributions. In the posterior case, however, we restrict attention to estimating the error by sampling over the support of the naive posterior.

    Appendix C: An illustrative Example

    Here we consider a simple curve-fitting problem to provide some further intuition for our method and to provide an example of how the method works with correlated measurement errors. We take the accurate model between parameters and observations to be given by urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0185, for urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0186 that is, an urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0187th order polynomial measured at points urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0188, and take the coarse model to be given by urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0189, for some urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0190.

    We assume a Gaussian prior and zero mean Gaussian additive noise, that is, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0191 and urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0192. The fact that both forward models are linear (in urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0193), and that both the prior and the noise distribution are Gaussian means that the resulting posterior is also Gaussian. Furthermore, no (MCMC) sampling is required. Linearity of both the models allows us to write

    In this case we have urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0195 where urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0196 is the diagonal orthogonal projection matrix, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0197 for urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0198 and zero otherwise. This results in the standard BAE composite posterior (in this simple case it is in fact possible to use the standard BAE approach as model failures are not an issue) coinciding with the posterior computed using our proposed approach.

    We compare (a) the naive approach, that is, ignoring the model approximation errors all together, (b) the posterior densities computed using our proposed approach, and (c) the true posterior, calculated using the accurate forward model, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0199.

    In all cases the same prior is used, while the associated likelihoods are modified. The naive posterior, the posterior based on the proposed approach, and the true posterior are given by
    respectively. The associated MAP estimates are given by
    and the posterior covariance matrices
    where urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0203, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0204, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0205, with urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0206, and urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0207.

    In line with the geothermal examples, we take the prior covariance matrix for urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0208 to be diagonal, that is, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0209, with urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0210 denoting the urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0211 identity matrix. This choice of prior, along with the fact that the simplified model is of the form urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0212, in fact results in the posterior of our proposed method being identical to the true posterior for the first urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0213 parameters, that is, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0214.

    To demonstrate how the method works with correlated measurement noise, we take the additive noise to be of the (multilevel) form
    where urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0216 is a block diagonal matrix of the form
    with urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0218 and urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0219 used to denote the square matrices of all ones and all zeros respectively. Several draws of correlated errors are shown in Figure C1.
    Details are in the caption following the image
    Five draws from the correlated noise prior distribution (left) and the data used for this synthetic example (red crosses) along with the true underlying model (black dashed line).

    For this example we specify the number of measurements as urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0220, with measurement points equally spaced between urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0221 and urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0222. For ease of visualization we take the accurate model to be a quadratic, while the coarse model is linear; that is, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0223 and urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0224. The prior mean is urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0225, and we take urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0226. Finally, the correlated noise distribution is set by taking urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0227, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0228, and setting the block diagonal matrix urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0229 to have three urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0230 diagonal blocks, this corresponds to a noise level of 30% of the maximum of the noiseless synthetic measurements; that is, urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0231; see Figure C1. Also shown in Figure C1 are the data for this example.

    The resulting marginal and joint posteriors for urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0232 and urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0233 using each of the methods are shown in Figure C2, while the posterior predictive plots are shown in Figure C3. It is clear that using the naive posterior (i.e., neglecting the approximation errors) can lead to an infeasible posterior, in the sense that the true values for urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0234 have almost vanishing posterior probability. On the other hand, in this example, using the proposed posterior composite error model leads to a feasible posterior with a more representative MAP estimate.

    Details are in the caption following the image
    Joint distributions (top) and marginal distributions for urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0235 (left bottom) and urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0236 (right bottom). The prior is shown using either gray scale or a solid black line, the true posterior is shown in green, the naive posterior is shown in red, the posterior found using the proposed approach is shown in blue, and the true values are identified with either black cross or a dashed black line. Note that in the marginal plot for urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0237 the true posterior is identical to marginal posterior found using the proposed method; furthermore, in the marginal plot for urn:x-wiley:wrcr:media:wrcr24396:wrcr24396-math-0238 both posterior marginals using the simpler model are equal to the prior marginal.
    Details are in the caption following the image
    Posterior predictive plots using the true posterior (top), the naive posterior (left bottom), and the posterior composite error model posterior (right bottom). In all plots the data are indicated with red crosses, the true underlying model is shown with the red solid line, the predictive mean is shown with the blue dashed line, while the uncertainty in the prediction intervals is indicated using gray scale, with higher probability density indicated by darker shading.