# Incorporating Posterior-Informed Approximation Errors Into a Hierarchical Framework to Facilitate Out-of-the-Box MCMC Sampling for Geothermal Inverse Problems and Uncertainty Quantification

## Abstract

We consider geothermal inverse problems and uncertainty quantification from a Bayesian perspective. Our main goal is to make standard, “out-of-the-box” Markov chain Monte Carlo (MCMC) sampling more feasible for complex simulation models by using suitable approximations. To do this, we first show how to pose both the inverse and prediction problems in a hierarchical Bayesian framework. We then show how to incorporate so-called posterior-informed model approximation error into this hierarchical framework, using a modified form of the Bayesian approximation error approach. This enables the use of a “coarse,” approximate model in place of a finer, more expensive model, while accounting for the additional uncertainty and potential bias that this can introduce. Our method requires only simple probability modeling, a relatively small number of fine model simulations and only modifies the target posterior—any standard MCMC sampling algorithm can be used to sample the new posterior. These corrections can also be used in methods that are not based on MCMC sampling. We show that our approach can achieve significant computational speedups on two geothermal test problems. We also demonstrate the dangers of naively using coarse, approximate models in place of finer models, without accounting for the induced approximation errors. The naive approach tends to give overly confident and biased posteriors while incorporating Bayesian approximation error into our hierarchical framework corrects for this while maintaining computational efficiency and ease of use.

## Key Points

- We consider geothermal inverse problems and uncertainty quantification from a Bayesian perspective
- We present a simple method for incorporating posterior-informed approximation errors into a hierarchical Bayesian framework
- Our method makes standard out-of-the-box MCMC sampling feasible for more complex models while correcting for bias and overconfidence

## 1 Introduction

Computational modeling plays an important role in geothermal reservoir engineering and resource management. A significant task for decision making and prediction in geothermal resource management is so-called *inverse modeling*, also known as *model calibration* within the geothermal community, and as solving *inverse problems* in applied mathematics. Calibration consists of determining parameters compatible with measured data. This is in contrast to so-called *forward modeling* in which a simulation is based on known model parameters. Comprehensive reviews of geothermal modeling, including both forward modeling and model calibration, are given by O'Sullivan et al. (2001) and O'Sullivan and O'Sullivan (2016).

The primary parameters of interest in geothermal inverse problems include the anisotropic permeability of the subsurface and the location and strength of so-called deep upflows/sources. Knowledge of the values of these parameters allows for forecasts of, for example, the temperature and pressure down drilled, or to be drilled, wells, to be made. On the other hand, the available (i.e., directly measurable) quantities are instead typically temperature, pressure, and enthalpy at observation wells (O'Sullivan et al., 2001; O'Sullivan & O'Sullivan, 2016). A typical geothermal inverse problem for a natural, that is, steady state, preexploitation model then consists of, for example, estimating formation permeabilities based on temperature and/or pressure measurements at observation wells.

The predominant method used to solve geothermal inverse problems is still manual calibration (Burnell et al., 2012; Mannington et al., 2004, 2004; O'Sullivan & O'Sullivan, 2016; O'Sullivan et al., 2009), although it is well recognized that this is far from an optimal strategy. To address this situation, there has been a concerted effort to automate the calibration process. For example, software packages such as iTOUGH2 (Finsterle, 2000) and PEST (Doherty, 2015) have been developed, and used, for geothermal model calibration. These packages are primarily based on framing the inverse problem as one of finding the minimum of a regularized cost, or objective function; though essentially deterministic, approximate confidence (or credibility) intervals for model parameters can be constructed from local cost function derivative information (Aster et al., 2018). Even for optimization-based approaches to geothermal inverse problems, computations can be expensive and improvements are required to speed up the process. We recently proposed accelerating optimization-based solution methods using adjoint methods and randomized linear algebra (Bjarkason, 2019; Bjarkason et al., 2018, 2019).

Bayesian inference is an alternative to optimization-based approaches and is instead an inherently probabilistic framework for inverse problems (Kaipio & Somersalo, 2005; Tarantola, 2004; Stuart, 2010). This naturally allows for incorporation and quantification of uncertainty in the estimated parameters; when posed in the Bayesian setting, the solution to the inverse problem is an entire probability density over the parameters. Here we adopt a *hierarchical* Bayesian approach in particular, where we use “hierarchical Bayes” in the sense of Berliner, (1996, 2003, 2012). This approach is discussed in detail in section 3. The key to the method proposed here is incorporating approximation errors between an accurate and a coarse model as a component in our hierarchical framework, by adapting the Bayesian approximation error (BAE) approach (Kaipio & Somersalo, 2005; Kaipio & Kolehmainen, 2013). This allows us to speed up computation of parameter estimates while avoiding overconfidence in biased estimates by accounting for the approximation errors induced when coarsened models are used. The trade-off for improved computation time is modified posteriors with inflated variance relative to the ideal target posterior.

There is only a relatively small amount of literature taking a fully Bayesian approach to geothermal inverse problems (e.g., Cui et al., 2011; Cui, Fox, O'Sullivan, & Nicholls 2019; Cui, Fox, & O'Sullivan 2019; Maclaren et al., 2016) where by “fully Bayesian” we mean sampling (or otherwise computing) a full probability distribution rather than calculating a single point estimate and making local approximations to the posterior covariance matrix. We previously presented a hierarchical Bayesian approach to frame the inverse problem and used a generic sampling method to solve the resulting problem (Maclaren et al., 2016). On the other hand, Cui et al. (2011) and Cui, Fox, and O'Sullivan (2019) developed a more sophisticated adaptive sampling scheme based on using a coarsened model and a fine model. The present work is based on extending the hierarchical Bayesian framework of Maclaren et al. (2016) to explicitly use approximate models while being independent of which sampling scheme is used and straightforward to implement.

## 2 Background: The Bayesian Approach to Inverse Problems

The Bayesian framework for inverse problems allows for systematic incorporation and subsequent quantification of parameter uncertainties (Kaipio & Somersalo, 2005; Stuart, 2010), which can then be propagated through to model predictions. In this framework, the solution to the inverse problem is an entire probability distribution, that is, the *posterior probability distribution*, or simply *the posterior*. Both epistemic (knowledge-based) and aleatoric (actually random) uncertainties are represented using the same probabilistic formalism in Bayesian inference.

*likelihood*and is the

*prior*.

A drawback of the fully Bayesian approach is the intensive computational cost that is usually required to apply Bayes' theorem, especially in the case of complex models such as in the geothermal setting (see, e.g., Cui et al., 2015). The dominant cost is repeated evaluation of the forward model and thus coarsened or surrogate models are often used in place of the most accurate forward model (see, e.g., Asher et al., 2015). Furthermore, the use of coarsened or surrogate models can help alleviate numerical instabilities (Doherty & Christensen, 2011). However, replacement of an accurate model with a surrogate invariably results in so-called *approximation errors*, which, if not accounted for, can lead to parameters and their associated uncertainty being incorrectly estimated (see, e.g., Doherty & Welter, 2010; Kaipio & Somersalo, 2007; Kennedy & O'Hagan, 2000). Next we give a brief overview of the main approaches in the literature for accounting for these errors. We then discuss how we incorporate these ideas into a hierarchical framework.

### 2.1 Approximation Errors and Model Discrepancies

In the Bayesian viewpoint, approximation errors can be treated as a further source of uncertainty. There are two standard approaches for doing this, that is, dealing with approximation errors: that based on the work of Kennedy and O'Hagan (2000) (referred to as KOH hereafter) and the BAE approach proposed by Kaipio and Somersalo (2005). The underlying principles of both approaches are similar, though with some implementation and philosophical differences. In particular, the KOH method was explicitly developed to account both for the difference between “reality” and a given simulation model and to allow for efficient emulation of computationally expensive models at arbitrary values (Higdon et al., 2004, 2008; Kennedy & O'Hagan, 2000). The typical KOH method is based on infinite-dimensional Gaussian process models: one to model the difference between reality and the simulation model and one to represent the output of the simulation model at new input values. Usually, only one physically based model is used (Higdon et al., 2004, 2008).

The BAE approach, in contrast, is based on two physically based simulation models: one which represents the “best,” but typically very expensive model, and one representing a coarser model, which nevertheless preserves the key physics of the problem. Furthermore, the approximation errors between the two physically based models are represented by a finite-dimensional multivariate Gaussian distribution, defined only at the locations of interest. The statistics of the approximation errors are directly estimated empirically, based on a small number of simulations of both the accurate and coarse models, and structural constraints are not typically placed on the form of the covariance matrix (Kaipio & Kolehmainen, 2013). While differences between the fine model and coarse model in the BAE approach are generally considered “approximation” errors between two different models, rather than “discrepancies” between reality and a model as in the KOH approach, these approximation errors typically include significant correlation structure, and the approach has been shown to work well in physical experiments (see, e.g., Lipponen et al., 2011, 2008; Nissinen et al., 2010). Additional systematic error can also be incorporated in the BAE approach in a straightforward manner; that is, it can directly incorporate correlation structure for both the error between the fine model and the data and in the error between the fine model and the coarse model.

The BAE approach is particularly simple to implement and, given two physical models, requires less user input in terms of parameters and hyperparameters than the KOH approach. For a further discussion and comparison of the the two methods see Fox et al. (2013). In this work we use (a variant of) the BAE approach; in contrast to past work in this area, however, we explicitly incorporate the approximation errors into a hierarchical framework. We discuss this next.

## 3 Hierarchical Framework

Here we outline our hierarchical Bayesian framework and where approximation errors enter. Implementation details are given in the following sections.

*modeling*assumptions about the conditional independencies separating measurement and process variables (Berliner, 1996, 2003, 2012). For example, the measurement model (first factor) is assumed to be independent of the process parameters, while the process model (second factor) is assumed to be independent of the observation parameters. In terms of our current problem variables this becomes

### 3.1 Representation Using Functional Relationships

*total error*term , allows us to write the above two-level model as a single-level model:

### 3.2 Likelihood

The last step above follows when both error vectors are independent of the parameter, and is used to denote the Dirac delta distribution, which places all mass at 0. The assumption of independence of the model error vector and the parameter vector is discussed in detail in section 4. We have also explicitly denoted which probability distribution is being evaluated using subscripts.

These steps can be considered as a change of variables from to via the delta method (Au & Tam, 1999; Khuri, 2004). As would be expected, this likelihood can also be obtained by marginalizing out the process variable in the factorization given in equation 3 and using the equivalent two-stage representation of the hierarchical model in equations 4, but the above derivation is slightly simpler.

### 3.3 Error Components in the Hierarchical Framework

Here we consider the two key sources of error, measurement error and process (approximation) error, in more detail.

#### 3.3.1 Measurement Error

In the two physically motivated cases considered in this paper we make the assumption that the measurement errors are also pairwise independent. However, this assumption is not required, and changing it simply changes the covariance matrix of the measurement errors (assuming they are Gaussian). A simple example using correlated measurement noise is provided in Appendix Appendix C.

#### 3.3.2 Process Error

Process errors, that is, approximation errors, are introduced by using a coarse model in place of a finer, or more accurate, simulation model. The fine model is represented by a function and the coarse model by a function . Here and are the (vectors of) fine- and coarse-scale parameters of interest; in our case the fine-scale and coarse-scale parameters will have the same dimension here, despite corresponding to different discretization grids. That is, both models share the same parameters, and thus, we will drop the explicit distinction between and in what follows and simply refer to both by (but see Appendix Appendix A for a discussion of the relationship between fine-scale and coarse-scale parameter grids).

#### 3.3.3 Total Error

To construct a model of the total error, we (a) assume that the measurement error is Gaussian and (b) approximate the process model error as Gaussian. Both of these random vectors may in general exhibit significant correlations between their respective components, and this is accounted for in the present approach, but the two vectors are assumed independent of each other. That is, we assume process error and measurement error are independent of each other. This makes combining these two errors straightforward (as described in the next subsection). Ultimately, we determine whether these, and the other approximations used thus far, are reasonable based on whether they work in practice—for example, whether they recover good estimates of the true parameters in test cases and whether any available error distributions “look normal” when plotted (or, if desired, pass formal tests of normality).

## 4 Computation of Error Models: Standard, Composite, and Posterior Informed

Both the probabilistic process model and process error model are typically intractable to simulate from for more than a limited number of realizations, as both involve the expensive fine-scale model. This motivates using approximations to these distributions, and results in approximate posterior distributions relative to the ideal target. The goal of these approximations is not to accurately estimate the model error as such but to approximately model the effect of marginalizing over it. This is for the purpose of reducing the bias/overconfidence in parameter estimates that would result from just using the simpler model directly; some loss of precision/statistical efficiency is expected. Here we give an overview of how the standard, composite, and posterior-informed approximation error models are computed and discuss relevant related literature. Explicit algorithms are given in the following section.

### 4.1 Premarginalization

Due to the computational issues discussed above, in the standard BAE approach the statistics of the approximation errors are precomputed empirically via directly drawing samples from the prior distribution, without the use of Markov chain Monte Carlo (MCMC). Similarly, here we compute the statistics of the approximation errors via direct sampling, though from a (naive) posterior distribution rather than the prior distribution, which itself can be (and, here, was) computed by separate MCMC sampling. MCMC sampling methods are discussed in section 5.3.

Our approach has the advantage of allowing a set budget of fine model runs to be specified, as well as minimal implementation difficulty. In contrast, some recent MCMC sampling schemes explicitly estimate and incorporate approximation errors *during* the MCMC sampling process. Similarly to our proposed method, Cui et al. (2011) and Cui, Fox, and O'Sullivan (2019) consider carrying out MCMC sampling on models of distinct levels of discretization while accounting for the approximation error; however, they use an adaptive delayed acceptance BAE approach to build the approximation error model during the MCMC sampling. In the methods developed by Cui et al. (2011) and Cui, Fox, and O'Sullivan (2019) the accurate model is typically run for each MCMC sample accepted based on the coarse model. This can make it more difficult to control the number of fine model runs used. While it is possible in principle to further modify the MCMC scheme used to incorporate such constraints, our approach offers a simple and direct way of controlling the number of fine model runs used.

Xu et al. (2017), Zhang et al. (2018), and Lødøen and Tjelmeland (2010) apply the KOH method to account for the approximation errors, and these are incorporated into an *adaptive multifidelity* MCMC sampler (see, e.g., Peherstorfer et al., 2018), the *differential evolution adaptive metropolis* (Vrugt et al., 2009; Laloy & Vrugt, 2012) sampler, and the Metropolis-Hastings algorithm (see, e.g., Chib & Greenberg, 1995), respectively. Again, these require more sophisticated understanding and control of the MCMC scheme used and involve infinte-dimensional stochastic processes following the approach of KOH. Here we provide a simple alternative based on the BAE approach to approximation error and involving finite-dimensional probability distributions only.

### 4.2 Standard Approximation Error

*true*(i.e., involving the fine-scale model) joint distribution

### 4.3 Enhanced, or Composite, Approximation Error

*true*conditional error distribution, and hence, the samples are used to estimate the

*true*marginal. On the other hand, in all subsequent calculations the joint distribution is approximated by the product of the marginals. This is equivalent to using the marginal error distribution for as a plug-in empirical estimator of the conditional error distribution for in the hierarchical model, prior to subsequent inference steps. Importantly, this does not mean that the individual errors in the vector are independent of each other, rather that the vector random variable is independent of the vector random variable . The estimated errors almost always exhibit significant correlations between components, and these are accounted for here.

As emphasized above, the goal is not to get the error exact, but to account for it in a somewhat “conservative” manner. While in the BAE literature this is referred to as the enhanced error model, the replacement of an intractable conditional distribution in a product of distributions by a more accessible marginal distribution is also similar in philosophy to that used in, for example, the composite likelihood literature (Varin, 2008; Varin et al., 2011). Hence, we will prefer to refer to it as the *composite* error model in the remainder of the text.

Finally, we note that *after* both the true marginal process model error has been empirically estimated, and the plug-in replacement has been made for the conditional distribution, the full process model error vector,
, is assumed to be (formally) conditionally independent of the full parameter vector,
, in any subsequent manipulations of the probability distributions.

### 4.4 Posterior-Informed Composite Approximation Error

*posterior plug-in*estimate of the model approximation error. In particular, we make the plug-in estimate

Thus, we are simply using a different plug-in estimate of the model error. Since this is now used with the coarse model we can revert to the broad prior without model run failures in all subsequent calculations.

Again, because our goal is not to model the error exactly, but rather to model the effect of marginalizing over it, we are willing to tolerate more potential inaccuracies at this stage. The present step of using posterior sampling for the approximation error is “riskier” than that in the previous section, however, in the sense that it involves a formal “double use of data” and tends to *narrow* rather than widen the error distribution, when compared to the distribution that results from using the prior. A geometric interpretation of this posterior model approximation step, and its potential dangers, is given in Appendix Appendix B.

Despite the above warnings, we believe that the use of posterior approximation errors, as described in the present work, is often a practical solution in complex models. It also has the benefit of providing more “relevant” estimates of the model error when the posterior based on the coarse model is not too far from the true posterior. One way to check this assumption would be to recompute the model error distribution under the final posterior and compare it to the error distribution computed under the coarse model posterior; checking for similarity of these distributions can be thought of as a form of posterior predictive check (see, e.g., for a good general discussion of posterior predictive checks; Gelman et al., 2013). This check does, however, require recomputing realizations from the fine-scale model and so is not always practical.

## 5 Statistical Algorithms

By taking a Gaussian approximation of the process error, we can characterize its distribution with the mean and covariance only. As discussed above, these cannot be computed analytically in general and thus must be estimated empirically via samples. In this section we give algorithmic details for both the standard *composite error model* approach and our proposed *posterior-informed composite error model* approach. Pseudocode is provided for both of the methods. We also outline the MCMC method used for sampling the resulting target posterior.

### 5.1 The Standard Composite Error Model Approach

This new distribution is then used to update the likelihood, which consequently updates the posterior density.

Algorithm 1 gives pseudocode for the standard composite error model approach for constructing the distribution of the total errors and for carrying out the inversion.

### 5.2 The Proposed Posterior-Informed Composite Error Model Approach

In the approach proposed here, we avoid sampling from the prior density of
to generate the ensemble
, to avoid model failures and extreme run times. Instead, we initially construct a *naive posterior* density of
, denoted
, (done here) using MCMC with the likelihood function induced by the noise term,
, only, and using the coarse model,
. This results in samples from the naive posterior,
, which are then passed through the two models to construct the process model errors,
. Once these samples for
have been generated, the method is essentially the same as that of the standard composite error model approach.

Pseudocode for the proposed posterior-informed composite error model approach is given in Algorithm 2.

### 5.3 MCMC Sampling

In the present work, MCMC sampling is carried out using the Python package *emcee* (Foreman-Mackey et al., 2013). This package implements an affine invariant ensemble sampler (Goodman & Weare, 2010), with the benefit of being easy to implement for arbitrary user-defined models. It also allows for easy communication with the PyTOUGH Python interface (Croucher, 2011) to TOUGH2 (Pruess et al., 1999) and AUTOUGH2 (Yeh et al., 2012) (The University of Auckland's own version of the TOUGH2) for carrying out the forward simulations.

For large-dimensional problems the affine invariant ensemble sampler may be inadequate (Huijser et al., 2015), in which case, alternative out-of-the-box samplers like those available in Stan (Carpenter et al., 2017), or PyMC (Patil et al., 2010) could be used. However, as alluded to earlier, the approach outlined here is essentially independent of the choice of particular MCMC sampler, providing flexibility in the choice of MCMC sampling scheme used while also being compatible with nonsampling, optimization-based methods.

## 6 Computational Studies

We consider multiphase nonisothermal flow in a geothermal reservoir, including both two-dimensional and three-dimensional reservoir case studies.

### 6.1 Governing Equations for Geothermal Simulations

*parameter-to-observable map*, can be understood by examining the key terms in 25 and 26, following Cui et al. (2011). A more in-depth discussion is given in (Pruess et al., 1999). First, the amount of mass and energy per unit control volume are given by

### 6.2 Model Setup and Simulation

We consider two scenarios as case studies—the first is based on a synthetic two-dimensional slice model, while the second is based on the Kerinci geothermal system, Sumatra, Indonesia. Each case study involves both a fine model and a coarse model, and thus, in total, we have four computational geothermal models in this work.

In all cases we solve the forward problem using the computer package AUTOUGH2 (Yeh et al., 2012), The University of Auckland's version of the TOUGH2 (Pruess et al., 1999) simulator, with the pure water equation of state model, that is, EOS1. We only consider steady state conditions, though, as standard, we calculate steady states via time marching to assist convergence to proper model solutions.

The parameters of interest in both case studies are rock permeabilities, which are associated to a given rock type. There has been some work on allowing a distinct rock type for each cell in the computational model (Bjarkason et al., 2018, 2019; Cui et al., 2011; Cui, Fox, & O'Sullivan 2019). However the standard approach in geothermal modeling and inversion (Fullagar et al., 2007; O'Sullivan & O'Sullivan, 2016; Popineau et al., 2018; Witter & Melosh, 2018), and the approach taken here, is to base the simulation model on a conceptual model of the geological structure. The simulation hence respects the lithologic boundaries of these geological models. Mathematically, this is equivalent to *regularization by discretization*, see, for example, Kaipio and Somersalo (2005) or Aster et al. (2018), and is a way of incorporating important prior information. The present approach can allow for arbitrary assumptions on the rock type structure, though at the cost of higher dimensionality and/or increased ill posedness. We aim to investigate the effects of including uncertainty in geological structure in future studies.

#### 6.2.1 Case Study I: Slice Model

For this case study we consider a two-dimensional slice model, shown in Figure 1, based on that considered in Bjarkason et al. (2016) and Maclaren et al. (2016).

The model geometry is a rectangular slice with physical dimensions of 1,600 m deep and 2,000 m wide. For our test problem we restricted the unknowns to a set of 12 parameters, two each for six rock-type regions, where these regions are assumed known in the present work. The location and intensity of the source are also assumed known. All six rock types are assumed to have the same porosity (10%), rock grain density (2,500 kg/m ), thermal conductivity (2.5 W/(m K)), and specific heat (1.0 kJ/(kg K)). The top boundary condition consists of constant pressure of 1 atm and constant temperature of 15 °C. The bottom boundary condition consists of a constant heat flux of 80 mW/m , except at the bottom-left corner region (see Figure 1) where 7.5 10 kg/(s m ) of a 1,200-kJ/kg enthalpy fluid is used as a deep source input. The side boundaries are closed.

The (noisy) measurements consist of temperatures taken at 15 depths down each of 7 vertical wells; this gives a total of 105 measurement points; see Figure A1. The synthetic data are corrupted by additive independent identically distributed mean zero Gaussian noise, which has a standard deviation of 5 °C.

We used two different computational discretizations, described in section 6.3.

#### 6.2.2 Case Study II: Kerinci Model

For this case study we consider a three-dimensional model of the Kerinci geothermal system, Sumatra, Indonesia, shown in Figure 2. This is based on a model developed by Prastika et al. (2016). We briefly recap the key model features here; for full details see Prastika et al. (2016).

The model geometry has physical dimensions of 16 km by 14 km (horizontal dimensions) by 5 km (depth). Our problem has a set of 30 parameters, 3 each for 10 rock-type regions, where these regions are assumed known in the present work. One of these “rock types” corresponds to an atmospheric layer, so we have 27 key parameters of interest to estimate. All nine nonatmospheric rock types are assumed to have the same porosity (10%), except for the rock labeled C0001 (representing pumice), which has a slightly higher porosity (12%). The rest of the properties of the nonatmospheric rock types were uniform, assumed to have the same rock grain density (2,500 kg/m ), thermal conductivity (2.5 W/(m K)), and specific heat (1.0 kJ/(kg K)). The top boundary condition consists of constant pressure of 1 bar and constant temperature of 25 °C. Most of the bottom boundary consists of a constant heat flux of 80 mW/m , except for a small number of blocks specified as the locations of deep source input (see Prastika et al., 2016). The total flow rate of the deep source input is 100 kg/s, split into 70 kg/s of fluid with an enthalpy of 1,400 kJ/kg, and 30 kg/s of fluid with an enthalpy of 1,100 kJ/kg.

Measurements consisted of a total of 17 temperature measurements taken down three wells. We assume that the data are corrupted by additive independent identically distributed mean zero Gaussian noise with a standard deviation of 10 °C.

We again used two different computational discretizations, described in section 6.3.

### 6.3 Approximation Error Computations

For each case study, we calculated the statistics of the approximation errors by using the AUTOUGH2 simulator. The same process was used in each case, though slightly different numbers of simulations were used for each case study. We outline the general process below while indicating any differences between case studies.

#### 6.3.1 Calculation Steps

In each case study, to calculate the statistics of the approximation errors, we simulated both the fine model, , and the coarse model, , 1,000 times each using AUTOUGH2. These simulations were taken over the naive posterior, which was first generated by running MCMC using the coarse model and without accounting for the approximation errors

For the slice model scenario, the naive posterior was constructed from 150,000 samples generated by MCMC, while for the Kerinci scenario we generated 90,000 samples. The statistics of the approximation errors were then calculated, as described above, by running the coarse and fine models on 1,000 samples randomly selected from the full set of 150,000 (slice model) or 90,000 (Kerinci) naive posterior samples.

For the slice model scenario the fine model geometry consisted of a square grid of 81 100 = 8,100 blocks (including one layer of atmospheric blocks), and the coarse model consisted of a grid of 17 20 = 340 blocks (again including one layer of atmospheric blocks). These model grids are shown in Appendix Appendix A.

For the Kerinci model scenario the fine model geometry consisted of 5,396 blocks (including one layer of atmospheric blocks) and the coarse model consisted of 908 blocks (again including one layer of atmospheric blocks). These model grids are again shown in Appendix Appendix A.

In each case study we ensured consistency of measurement locations using functionality of PyTOUGH described in O'Sullivan et al. (2013), which allows the same observation wells to be defined independently of grid resolution.

### 6.4 MCMC Computations

For the slice model scenario (and both with and without incorporation of the approximation errors) 150,000 samples were computed (an ensemble of 300 *walkers* taking 500 samples each) after discarding an initial 30,000 *burn-in* samples.

For the Kerinci scenario (both with and without incorporation of the approximation errors) 90,000 samples were computed (6 ensembles of 300 *walkers* taking 50 samples each) after discarding a total of 30,000 *burn-in* samples (5,000 for each ensemble).

All computations were carried out on a standard desktop computer with an AMD Ryzen 5 1600 3.2-GHz 6-Core Processor.

### 6.5 Computational Requirements of Forward Model and MCMC Simulation

In the slice model scenario, the fine model took approximately 1–5 min per simulation, while the coarse model took less than half a second per simulation, typically about 0.45 s. Thus, generating 150,000 samples using naive MCMC to construct the posterior distribution using the fine model would take around 100–500 days, whereas using the same number of samples to construct the approximation error informed posterior using the coarse model took just less than 20 hr. Only taking into account these MCMC runs, in the worst case this represents a speedup of at least a factor of 100.

In the Kerinci scenario, the fine model took approximately 30 s per simulation for well-behaved cases, but potentially several hours for less well behaved models. The coarse model typically took about 1–10 s per simulation but could take several minutes for less well behaved cases. The run times for both of these cases were much more variable for this model than for the slice model. We generated 90,000 samples by running six chains in parallel (see below for more detail) and then combining these. Generating 90,000 samples to construct the posterior distribution using the fine model and naive MCMC would take at least a year, and possibly up to a decade, whereas using the same number of samples to construct the approximation error informed posterior using the coarse model and naive MCMC (again run in six parallel batches) took about 12 days.

More sophisticated parallelization (Laloy & Vrugt, 2012; Vrugt et al., 2009), or use of gradient information (Carpenter et al., 2017; Patil et al., 2010), in the MCMC sampling algorithms could of course considerably change these timing estimates for full MCMC. Here we restrict attention to a particularly simple black-box MCMC sampler that can be easily coupled to AUTOUGH2 simulations. In general, however, we would still expect significant practical speedups in a range of realistic scenarios, as the BAE approach is suited to problems in which approximate premarginalization can be carried out with many less samples than required for full MCMC sampling.

In addition to the above rough timing estimates, the approximation error calculations further require both a naive posterior and the model approximation error statistics to be calculated. In each case, approximately the same amount of time was required to run full MCMC for the naive case and for the approximation error informed case. The key cost for all MCMC calculations is running the coarse model; only the statistics of the particular likelihood model differ. Thus, for the slice model, approximately 20 hr was required to run full MCMC for the naive case and another 20 hr for the approximation error informed case. For the Kerinci model, between 2 and 6 days would be expected to carry out naive MCMC in general; here we found it took about 2.5 days when sampling from the prior and running the naive model.

In contrast to the MCMC cases, simulations of both the coarse and accurate model are required to calculate the approximation errors statistics. For the slice model scenario we generated 1,000 samples by running 200 runs of each in parallel on five nodes; this also took just under 20 hr. Thus, the total time for inversions using (a naive version of) the approximation error approach is approximately 20 3 = 60 hr. The worst-case effective speed-up factor compared to naive sampling is thus at least 30 in the present work, but typically more like 50–150.

For the Kerinci scenario we again used 200 runs of each model in parallel on five nodes. This took approximately 5.5 days. Thus, the total time for inversions using the approximation error approach was around 2.5 + 5.5 + 6 = 14 days (2 weeks). This again gives a speedup (compared to, e.g., 1–10 years) of at least about 30.

Natural ways to further increase the speedup of the approximation error calculations include, for example, only running an approximate, optimization-based sampler to generate the initial naive posterior (from which only 1,000 samples will be used). In our case, however, we simply ran full MCMC separately for both the naive and the approximation errors informed cases. This enabled us to give a relatively fair comparison of the results from these two models. Furthermore, the initially calculated naive posterior, or at least the second order statistics of this, could be used either to initialise the second (main) run of MCMC for our algorithm, or as a proposal distribution.

### 6.6 Availability

Our code was written in Python 2.7 using open source Python packages. It is available at GitHub (https://github.com/omaclaren/hierarchical-bae-manuscript).

An archived version of this code is available at Zenodo (http://doi.org/10.5281/zenodo.3509966).

Access to the AUTOUGH2/TOUGH2 (Pruess et al., 1999; Yeh et al., 2012) simulator is also required; we plan to adapt our code to use the new open-source Waiwera simulator (Croucher et al., 2018) when it is officially released.

The key functionality is implemented in a small library of object-oriented classes implementing the various components of the hierarchical framework.

## 7 Results and Discussion

Here we compare a series of inversion results for both the slice model and the Kerinci model scenarios. We focus on the results that are obtained, for each scenario, when using a coarse model without accounting for approximation errors and those obtained when using a coarse model when the approximation errors are accounted for. We consider both data space (posterior predictive) and parameter space (parameter posterior) distributions.

Particular emphasis is placed on (a) the feasibility of the posterior uncertainty estimates in parameter space, that is, the question of whether or not the posterior uncertainty is consistent with, that is, supports, the true ( ) permeability values, and (b) the role of predictive checks with and without incorporation of the approximation errors.

### 7.1 Slice Model Scenario

Here we consider results from the slice model scenario.

#### 7.1.1 Posterior Predictive Checks

In Figure 3 we show posterior predictive checks constructed by running the model on a subset of posterior samples obtained from MCMC. Realizations of the process model without measurement error are plotted in blue, while the data obtained from running the fine model and adding measurement error are shown in black. Figure 3a shows the posterior predictive check under the coarse model while neglecting the approximation errors.

As can be seen in the figure, the coarse model fits the data well and the uncertainties are small. Thus, this check does not flag any potential issue with naively using the coarse model. On the other hand, (b) and (c) show the predictive checks resulting from inference under the approximation error corrected model. In particular, (b) shows the results when only the covariance of the approximation errors are accounted for, while (c) shows the results when both the approximation error covariance and offset (bias) terms are included. Comparison of (b) and (c) shows that both error correlations and the bias term are important for obtaining a properly fitting model. More importantly, the difference in variation between (a) and (c) indicates that we are potentially *underestimating* the uncertainties involved in naively using the coarse model for inversion. Intuitively, the low-variance of the posterior is counterbalanced by the introduction of additional bias into the parameter estimates. This is illustrated in the next subsection.

An implication of these results is that, in general, *posterior predictive checks against the original data do not appear to indicate issues that arise due to inversion under a reduced-order model*. This is perhaps to be expected due to the ill-posed nature of inverse problems; that is, pure within-sample data fit checks are not sufficient to determine whether a model is appropriate. One potential fix for this is to either carry out checks on held-out data or, in our case, against a more expensive/accurate model, which effectively plays the role of held-out data.

#### 7.1.2 Posterior Parameter Distributions

Here we consider the (marginal) parameter space posterior distributions, both for the naive and the approximation error informed models. Figure 4 shows the marginal posteriors of the permeability for each rock type and each direction and both with and without incorporation of the approximation errors.

The first set of plots, in Figure 4, show the parameters for which fairly consistent results were reached by both the naive and approximation error informed models. The second and third sets of plots, shown in Figure 5 as two sets of plots labeled by (a) and (b) for easier visual comparison, show cases where the results tend to conflict between the naive and the approximation error.

As can be seen in Figure 5, naive inversion under the coarse model often results in essentially infeasible parameter estimates, that is, posteriors for which the truth is assigned only a low posterior probability density. On the other hand, the approximation error corrected case always assigns a high posterior probability density to the true parameters (though in *some* cases this is slightly lower than the density assigned under the naive case). In reality, of course, neither model will be correct, but it is hoped that the fine-scale model is a better reflection of the truth.

Some of the parameters appear to be effectively nonidentifiable, as indicated by the lack of updating when comparing the prior to posterior distributions (see for a systematic review and discussion of measuring statistical evidence in a Bayesian setting; Evans, 2015). This lack of identifiability can also be quantified using, for example, the Kullback-Leibler divergence; however, we prefer to present comparisons graphically, following the general Bayesian data analysis philosophy of Gelman et al. (2013). In particular
and
appear to be largely uninformed by the data. Physically, this could be explained by the fact that there is very little horizontal fluid flow in the cap rock and essentially all fluid in the outflow region is in the vertical direction. On the other hand, the remaining parameters appear to be reasonably well identifiable; some, in particular,
,
,
,
, and
appear to be strongly identified. Under the naive model, however, inversion for the strongly identifiable parameters gives posteriors that *appear* very well informed but are in fact providing effectively infeasible estimates. This provides another trade-off, where parameters that are strongly informed by the data under a model will hence tend be more strongly biased toward different values when estimated under a different model.

### 7.2 Kerinci Model Scenario

Here we consider results from the Kerinci model scenario.

#### 7.2.1 Posterior Predictive Checks

In Figure 6 we show posterior predictive checks constructed by running the model on a subset of posterior samples obtained from MCMC. Realizations of the process model without measurement error are plotted in blue, while the data obtained from running the fine model and adding measurement error are shown in black. Figure 6a shows the posterior predictive check under the coarse model without incorporation of the approximation errors, while Figure 6b shows the results incorporating the approximation errors.

As can be seen, in this more realistic model, and under more extreme model simplification (the discretization is significantly reduced and simplified in the coarse model) the approximation errors can be quite large. The difference in variation between Figures 6a and 6b certainly indicates that we are likely *underestimating* the uncertainties involved in naively using the coarse model for inversion. Although the coarse model predictive check provides a tighter fit around the measured data, it also assigns much less probability density to at least one data point, so in this sense provides a worse fit to the data and in this case flags potential underfitting of the coarse model.

#### 7.2.2 Posterior Parameter Distributions

Here we consider the (marginal) parameter space posterior distributions, both for the naive and the approximation error informed models; see Figures 7 and 8. For brevity we include a representative selection here; the remaining distributions as well as full corner plots (Foreman-Mackey, 2016) are given in the supporting information. The same basic patterns observed in the plots shown here can also be seen in the plots in the supporting information.

The first set of plots, in Figure 7, show the parameters for which fairly consistent results were reached by both the naive and approximation error informed models. The second set of plots, shown in Figure 8, shows cases where the results tend to conflict between the naive and the approximation error informed case. Here the true parameters are unknown and hence not shown.

### 7.3 Additional Comments

As we have noted above, standard MCMC sampling is much more computationally feasible for these geothermal inverse problems when using coarser models as opposed to finer, more accurate models. Importantly, however, we see that just naively using a coarse model without accounting for approximation errors tends to give overconfident and biased posteriors, for which the known true parameters can lie outside of the bulk of the support. On the other hand, taking into account the approximation errors leads to known true parameters lying inside the bulk of the support in all cases considered here. Both methods require effectively the same amount of computation time, though the BAE approach requires some additional initial computation to construct the model error statistics. This additional computational effort is the price paid to avoid misleading estimates and is still significantly less than attempting MCMC using the fine model.

In this paper we have only considered the use of relatively naive MCMC sampling to estimate the posterior density for the permeabilities, based on an approximation error informed coarse model. More sophisticated MCMC algorithms, for example, those utilizing parallelization (Laloy & Vrugt, 2012; Vrugt et al., 2009) and/or derivative information (Carpenter et al., 2017; Patil et al., 2010) would be expected to speed up the sampling significantly. In some settings, however, these more sophisticated forms of MCMC may still be computationally infeasible even using only the coarse model (with or without approximation errors included). In this case, the posterior approximation errors can still be constructed without MCMC, as long as some alternative method is available for drawing the (smaller) set of required samples from the naive posterior. For example, here we only required 1,000 samples from the naive posterior, compared to the 150,000 or 90,000 used for full MCMC runs. This would then enable the use of a coarse model which accounts for approximation errors alongside alternative sampling and/or optimization-based approaches.

## 8 Conclusions

We have demonstrated how to carry out simple yet computationally feasible parameter estimation and uncertainty quantification for geothermal simulation models by using a coarser, or cheaper, model in place of a finer, or more expensive, model. Our approach was to construct an approximation to the posterior Bayesian model approximation error and incorporate this into a hierarchical Bayesian framework. The hierarchical Bayesian perspective provides a flexible and intuitive setting for specifying assumptions on different model components and their combinations. In this view, approximations and modeling assumptions are directly incorporated into the framework by replacing joint distributions by factorizations in terms of simpler conditional and/or marginal distributions.

Our approach requires two simple initial computational steps in order to correct for the bias and/or overconfidence that would normally be introduced by directly using the coarse model in place of the finer model. These two steps then enable standard, out-of-the-box MCMC to be used to sample the parameter posterior using the coarse model. We demonstrated our approach can achieve significant computational speedups on both synthetic and real world geothermal test problems.

Our approach consists of three relatively simple steps overall and should be more accessible to general practitioners than having to manually implement more complex sampling schemes. Furthermore, the methods developed here should be generally applicable to related inverse problems such as, for example, those appearing in petroleum reservoir engineering and groundwater management.

## Acknowledgments

The authors appreciate the contribution of the NZ Ministry of Business, Innovation and Employment for funding parts of this work through the Grant C05X1306Geothermal Supermodels. The authors would also like to thank Jari Kaipio for helpful discussions about Bayesian approximation error methods, Joris Popineau for visualizations of the Kerinci model, Ryan Tonkin for useful discussions on geothermal modeling, and the three reviewers for feedback that significantly improved this manuscript. Our code is available from GitHub (https://github.com/omaclaren/hierarchical-bae-manuscript) and is archived at Zenodo (http://doi.org/10.5281/zenodo.3509966).

## Appendix A: Mapping Between Fine and Coarse Grids

Here
is a slight abuse of notation and still in fact represents the fine model evaluated on a fine parameter grid; however, the parameters are fixed to be homogeneous within a given rock type, matching the values in the corresponding rock types on the coarser grid. Thus, the parameter vectors have the same effective dimension (and values), equal to that of the coarse grid, and thus are in 1-1 correspondence. This is made clearer by comparing Figure A1 below to Figure 1 introduced earlier: Each mesh in Figure A1 represents a different discretization of the *same* underlying parameter grid given in Figure 1. This assumption means we can compute the approximation error by sampling the coarse parameters directly rather than the (larger-dimensional) fine parameters. Implicitly, however, this is neglecting some of the approximation error that would be induced by sampling over all fine parameter sets compatible with the given coarse parameter set. This assumption can be checked/removed to the extent that computational resources allow computing the error over the fine grid (Kaipio & Kolehmainen, 2013). Either way, the *coarse* grid parameters are the ultimate targets of inference, and by using the more conservative “enhanced” (or “composite”) error model based on the marginal error distribution, we can hope to account for some of this additional uncertainty indirectly.

The fine and coarse Kerinci models were related in the same manner, with only the mesh discretization varying. A top view of the two meshes in shown in Figure A2.

## Appendix B: Geometric View of BAE

In Figure B1 below we give a geometric picture of both the standard prior-based and our posterior-based composite (enhanced) error model approach. In both cases we essentially aim to conservatively cover the deterministic functional relationship , or the associated degenerate joint distribution , by a probability distribution based on marginal distributions. In the posterior case, however, we restrict attention to estimating the error by sampling over the support of the naive posterior. As can be seen in the figure, the accuracy of this procedure depends on, for example, how well the naive posterior approximates the true posterior. Alternatively, if the error is approximately independent of the parameter, hence giving a horizontal line for , then both the prior and posterior error distributions would give the same delta distribution for the error, regardless of how well the naive posterior approximates the true posterior. Thus, intuitively, the procedure would be expected to be most reasonable when (a) the naive posterior approximates the true posterior reasonably well and/or (b) when the model error does not depend strongly on the parameter. This latter condition is already a condition for the usual enhanced/composite error model approach to provide a reasonable approximation, and so switching to the posterior composite error model is at least consistent with this assumption.

## Appendix C: An illustrative Example

Here we consider a simple curve-fitting problem to provide some further intuition for our method and to provide an example of how the method works with correlated measurement errors. We take the accurate model between parameters and observations to be given by , for that is, an th order polynomial measured at points , and take the coarse model to be given by , for some .

In this case we have where is the diagonal orthogonal projection matrix, for and zero otherwise. This results in the standard BAE composite posterior (in this simple case it is in fact possible to use the standard BAE approach as model failures are not an issue) coinciding with the posterior computed using our proposed approach.

We compare (a) the naive approach, that is, ignoring the model approximation errors all together, (b) the posterior densities computed using our proposed approach, and (c) the true posterior, calculated using the accurate forward model, .

In line with the geothermal examples, we take the prior covariance matrix for to be diagonal, that is, , with denoting the identity matrix. This choice of prior, along with the fact that the simplified model is of the form , in fact results in the posterior of our proposed method being identical to the true posterior for the first parameters, that is, .

*correlated*measurement noise, we take the additive noise to be of the (multilevel) form

For this example we specify the number of measurements as , with measurement points equally spaced between and . For ease of visualization we take the accurate model to be a quadratic, while the coarse model is linear; that is, and . The prior mean is , and we take . Finally, the correlated noise distribution is set by taking , , and setting the block diagonal matrix to have three diagonal blocks, this corresponds to a noise level of 30% of the maximum of the noiseless synthetic measurements; that is, ; see Figure C1. Also shown in Figure C1 are the data for this example.

The resulting marginal and joint posteriors for and using each of the methods are shown in Figure C2, while the posterior predictive plots are shown in Figure C3. It is clear that using the naive posterior (i.e., neglecting the approximation errors) can lead to an infeasible posterior, in the sense that the true values for have almost vanishing posterior probability. On the other hand, in this example, using the proposed posterior composite error model leads to a feasible posterior with a more representative MAP estimate.