Physics-Informed Machine Learning Method for Large-Scale Data Assimilation Problems
Abstract
We develop a physics-informed machine learning approach for large-scale data assimilation and parameter estimation and apply it for estimating transmissivity and hydraulic head in the two-dimensional steady-state subsurface flow model of the Hanford Site given synthetic measurements of said variables. In our approach, we extend the physics-informed conditional Karhunen-Loéve expansion (PICKLE) method to modeling subsurface flow with unknown flux (Neumann) and varying head (time-dependent Dirichlet) boundary conditions. We demonstrate that the PICKLE method is comparable in accuracy with the standard maximum a posteriori (MAP) method, but is significantly faster than MAP for large-scale problems. Both methods use a mesh to discretize the computational domain. In MAP, the parameters and states are discretized on the mesh; therefore, the size of the MAP parameter estimation problem directly depends on the mesh size. In PICKLE, the mesh is used to evaluate the residuals of the governing equation, while the parameters and states are approximated by the truncated conditional Karhunen-Loéve expansions with the number of parameters controlled by the smoothness of the parameter and state fields, and not by the mesh size. For a considered example, we demonstrate that the computational cost of PICKLE increases near linearly (as N1.15) with the number of grid nodes N, while that of MAP increases much faster (as N3.28). We also show that once trained for one set of Dirichlet boundary conditions (i.e., one river stage), the PICKLE method provides accurate estimates of the hydraulic head for any value of the Dirichlet boundary conditions (i.e., for any river stage).
Key Points
-
The modified physics-informed machine learning PICKLE method for large-scale data assimilation is proposed
-
PICKLE is orders of magnitude faster than traditional a posteriori probability method for the considered high-resolution Hanford model
-
Trained for one set of boundary conditions, the PICKLE method can model data for different values of the boundary conditions
Plain Language Summary
We present a novel physics-informed machine learning framework for parameter and state estimation in large-scale natural systems's models. Using the Hanford Site as an example, we demonstrate that the proposed framework is comparable in accuracy with the standard parameter estimation methods but is significantly faster than these methods.
1 Introduction
Improving the predictive ability of numerical models has been the main goal of computational sciences. However, when applied to natural systems such as subsurface flow and transport, predictive modeling is complicated by the inherent uncertainty in the distribution of subsurface properties, including hydraulic conductivity, that enter the subsurface models as parameters. Uniquely estimating subsurface parameters from measurements of parameters and system states (e.g., hydraulic head) without numerical regularization is not possible because of the ill-posedness of the arising inverse problems. To further complicate the matter, the multiple length scales of heterogeneity and multiple time scales of flow and transport processes create enough ambiguity such that the same data can be described with different models, including deterministic and stochastic partial differential equation (PDE) models, non-local (integro-differential equation) models, and, most recently, machine learning and artificial intelligence-based models.
In selecting the right modeling approach, one can rely on Occam's razor or the law of parsimony principle that the simplest explanation (model) is usually the right one. However, given the above mentioned uncertainty in model parameters, the ability of models to be conditioned on spatially varying data is another critical criterion in selecting a computational model (Neuman & Tartakovsky, 2009). In theory, any model that involves observable parameters and states can be conditioned on the measurements of these variables if such are available. However, the computational cost of conditioning models on data may vary significantly.
Conditioning models on direct measurements of space-varying parameters (e.g., transmissivity) is relatively straightforward and can be achieved using Kriging or Gaussian process regression (GPR; Neuman, 1993; Tipireddy et al., 2020). In Kriging, an estimate of the conductivity field is obtained as a linear combination of the measured values and basis functions given in terms of a covariance kernel (Matheron, 1963; Rasmussen, 2003). Then, the flow problem (conditioned on the measurements of transmissivity) can be found by solving the Darcy flow equation with the transmissivity field given by the Kriging estimate. The advantage of using Kriging for conditioning flow models on transmissivity measurements is that it provides an empirical Bayesian prediction in terms of the conditional mean (the most likely distribution of transmissivity given its measurements) and the transmissivity covariance (a measure of uncertainty). The conditional mean and covariance of transmissivity can be used to obtain probabilistic estimates of the hydraulic head and fluxes conditioned on the transmissivity measurements. However, conditioning deterministic or stochastic flow predictions on the measurements of both the hydraulic head and transmissivity is more challenging because it requires solving an inverse problem that typically involves computing forward solutions of the Darcy equation multiple times for different realizations of the transmissivity field.
The inverse problem of computing the deterministic conductivity and hydraulic head fields given sparse measurements of these fields can be solved via maximum a posteriori (MAP) estimation, a Bayesian point estimation approach that consists of computing the largest mode of the posterior distribution of conductivity conditioned on the observations (D. A. Barajas-Solano et al., 2014; Kitanidis, 1996). As we later demonstrate in this paper, the computational cost of the MAP method rapidly increases with the number of unknown parameters (as the power of three of the number of unknown parameters). Several methods have been proposed to address this curse of dimensionality by reducing the number of parameters in the representation of the space-varying conductivity field. The simplest approach involves a coarse representation of the conductivity field that disregards small scale heterogeneity (e.g., variations of the hydraulic conductivity within geological layers). However, numerous studies (e.g., Shuai et al., 2019; A. M. Tartakovsky, 2010) have demonstrated that small-scale variations in conductivity might have a significant effect on large scale flow and transport processes. The pilot point method (PPM; Certes & de Marsily, 1991) addresses this issue by modeling the small-scale heterogeneity as a function of a small number of parameters (pilot points) that are estimated through the inverse procedure. Some of the challenges in PPM include the dependence of parameter estimates on the number and locations of the pilot points, the functional approximation of small-scale heterogeneity, and the regularization schemes (Doherty et al., 2010). A singular value decomposition (SVD) analysis of the sensitivities of observations with respect to pilot points can be leveraged to reduce the effective dimension of the inverse problem (see, e.g., Tonkin & Doherty, 2005). Other approaches for reducing the effective dimension of the inverse problem include the principal component geostatistical approach (PCGA; Kitanidis & Lee, 2014; Lee & Kitanidis, 2014), and methods based on data assimilation over latent spaces constructed via machine learning approaches, such as variational autoencoder methods and conditional generative adversarial networks (Kadeethum et al., 2021; O’Malley et al., 2019), among others.
Bayesian methods provide another approach for solving inverse problems. Compared to deterministic methods such as MAP, Bayesian methods aim to estimate the probabilistic distributions of the parameters and states conditioned on measurements (D. Barajas-Solano & Tartakovsky, 2019; Herckenrath et al., 2011; Li & Tartakovsky, 2020; Yoon et al., 2013), and by construction quantify the uncertainty in the inverse problem solution.
Physics-informed machine learning methods have recently emerged as promising tools for estimating parameters in differential equation models, including subsurface flow models. Many deterministic and probabilistic machine learning methods have been proposed where physics is enforced through optimization constraints or physics-model-generated data sets (Karpatne et al., 2017). Here, we focus on the physics-informed conditional Karhunen-Loéve expansion (PICKLE) method (A. Tartakovsky et al., 2020) where the state and parameter fields are represented using conditional Karhunen-Loéve expansions (CKLEs) (Tipireddy et al., 2020). The parameters in these expansions are found by minimizing the sum of square differences between the CKLE approximations and measurements of the fields, plus the sum of squared residuals of the governing equation. A similar approach is used in the physics-informed neural network (PINN) method (He et al., 2020; A. M. Tartakovsky et al., 2020) where deep neural networks are used to approximate parameter and state fields. The main difference between the physics-informed ML methods such as PICKLE and PINN and standard methods for solving inverse problems is that the ML methods only require computing derivatives with respect to time and space to evaluate the residuals and parameters as part of gradient-based minimization algorithms. There is no need to numerically solve the governing equation as in PPM, which is often the computational bottleneck of standard inverse methods. The key advantage of using CKLEs over neural networks for modeling the parameter field is that CKLEs enforce a spatial covariance structure on the modeled field and act as a geostatistics-based regularizer.
Like the PCGA method, PICKEL employs the eigendecomposition of the parameter covariance matrix. However, these methods differ in several key ways. In PCGA, the transmissivity field is modeled as a linear expansion with a polynomial basis plus a deviation component with a spatial distribution penalized via a weighted ℓ2-norm, where the weight is the inverse of a covariance matrix that models the spatial variability of the deviation component; furthermore, the computational cost in PCGA is reduced by a matrix-free optimization algorithm that leverages the eigendecomposition of the covariance matrix. On the other hand, in PICKLE, the transmissivity field is fully modeled via a CKLE, and computational cost is controlled by the size of this CKLE.
In this work, we use the PICKLE method for obtaining deterministic estimates of transmissivity and hydraulic head fields conditioned on measurements of said fields. We apply this method for modeling steady-state two-dimensional groundwater flow at the Hanford Site given synthetic measurements of the transmissivity and hydraulic head. The synthetic measurements are generated using the hydraulic conductivity measurements and boundary conditions obtained in the Hanford Site calibration study of Cole et al. (2001). The PICKLE method was introduced in A. Tartakovsky et al. (2020) for parameter estimation in PDE models for a given set of deterministically known fixed boundary conditions. Here, we extend the PICKLE method for problems with uncertain flux boundary conditions and incrementally changing Dirichlet boundary conditions. Another significant contribution of this work is testing the PICKLE methods for problems with high-dimensional realistic parameter (transmissivity) fields (we find that more than 1000 terms in the KL expansion are needed to accurately approximate the log-transmissivity field obtained from the Hanford Site calibration study). We compare the performance of the PICKLE and MAP methods and show that the two methods have comparable accuracy, while the computational cost of MAP increases significantly faster with the problem size than that of PICKLE.
2 Groundwater Flow Model and Maximum a Posteriori Formulation of the Inverse Problem
where is the boundary of D and and are the portions of the boundary where the Neumann and Dirichlet boundary conditions are prescribed, respectively. In Equation 2, is the normal flux at the Neumann boundary , and is the unit vector normal to . In Equation 3, is the prescribed hydraulic head on the Dirichlet boundary .
In this work, we assume that there are and observations of u and y = log T, respectively, organized into the vectors us and ys. The locations of u and y observations are organized into the arrays Xu and Xy, respectively. In groundwater models, usually models the boundaries formed by rivers, in which case is equal to the water level in these rivers, which is easy to measure. Therefore, we treat as a known function. The homogeneous Neumann boundary condition is imposed at the boundaries formed by the impermeable (e.g., basalt) layers, and we also treat this boundary condition as known. The non-homogeneous Neumann boundary conditions are commonly used to model groundwater inflow/outflow from/to rivers and lakes, groundwater discharge to the sea, and surface recharge. The measurements or estimates of groundwater inflow/outflow fluxes are significantly less accurate than those of the Dirichlet boundary conditions uD(x; Jazayeri & Werner, 2019). Therefore, in this work, we consider two cases, one where is known, and another where is unknown and is estimated along with y(x) and u(x).
with the stiffness matrix and right-hand side defined in Appendix A. In Equation 4, denotes the vector of discretized BVP residuals. The entries of the l vector correspond to the FV mass balance for each of the N cells. The set of cells can be split into three subsets: (the cells adjacent to ), ( cells adjacent to ), and , where is of cardinality . Let , denote the vector of l entries corresponding to the indices in the subset . Only the mass balance for the cells in the set explicitly includes the q contributions; therefore, it follows that q enters into l only on the entries, and that does not depend directly on q.
3 PICKLE Method for Inverse Problems
3.1 Method Formulation
Here, the superscript c denotes that uc and yc are conditioned on the measurements of y. Methods for computing , , , and are described in Section 3.2.
In this work, we set rtolu = rtoly = 10−6, unless mentioned otherwise.
for , where and are the solutions of the minimization problem (Equation 14). We note that is linear in q, so that the solution of Equation 15 is straightforward. Because we use a CKLE for y conditioned on y measurements, satisfies y observations by construction. On the other hand, the CKLE for u is not conditioned on u measurements; therefore, we have included a data misfit term with coefficient β in Equation 14 penalizing the deviation between u predictions and observations.
We use the Trust Region Reflective algorithm (Branch et al., 1999) to solve the minimization problems of both the PICKLE and MAP methods. To do this, we cast Equations 5 and 14 as nonlinear least-squares problems. The least-squares minimization algorithm requires the evaluations of the Jacobian matrix J of the objective vector with respect to the parameters to be inverted, which is the most computationally demanding part of the least-squares minimization. Jacobian evaluation in the PICKLE method only requires computing the derivatives of the PDE residuals with respect to the CKLE coefficients and does not require solving the BVP (Equations1–3). On the other hand, evaluating the Jacobian in the MAP method requires solving multiple BVPs, either for computing a finite difference approximation or for evaluating the Jacobian via the chain rule as in Appendix B. Therefore, Jacobian evaluation is significantly cheaper for PICKLE than for MAP. In addition, the cost of Jacobian evaluation depends on the Jacobian matrix size, which is in MAP and in PICKLE. Depending on the field smoothness and the required resolution, Nu and Ny can be chosen so that Nu + Ny ≪ N, leading to much smaller Jacobian matrices for PICKLE than those for MAP. The details of the optimization algorithm are given in Appendix C.
3.2 Computing Covariance Functions
The evaluation of and from u measurements using parameterized covariance models and marginal likelihood maximization is not adequate for two related reasons: (1) the u(x) field is not stationary (i.e., the covariance kernel of u depends on x and y and not just the distance between x and y as in, e.g., the Matérn kernel), which limits our choice of covariance kernels; and (2) this purely data-driven approach does not enforce the governing equations and boundary conditions on the mean and covariance function even approximately. Therefore, in PICKLE we employ a Monte Carlo (MC) simulation-based method for computing the conditional mean and covariance of u.
where is the vectors of independent and identically distributed Gaussian random variables. The eigenpairs satisfy the same eigenvalue problem Equation 8 as those in the deterministic CKLE of Section 7. We note that, does not need to be the same as Ny in Section 6, for example, it can be chosen with smaller rtoly to obtain a more accurate MC solution. However, in this study we set . Next, we construct an ensemble of Nens realizations of , by sampling ξ(i) from and evaluating the CKLE model (Equation19) with .
In this work, we set Nens large enough to assure that the PICKLE estimates of y do not change with further increase of Nens. In general, for to have at least Nu non-zero eigenvalues , the ensemble size should be . When it is not feasible to perform MC simulations, shrinkage estimators can be employed to regularize the covariance matrix estimation (Chen et al., 2009). The accuracy of the covariance estimation with a small number of MC simulations can be increased by performing additional less-expensive coarser-resolution simulations using the Multilevel MC approach (Giles, 2015; X. Yang et al., 2021, 2019). Also, there are several computationally efficient alternatives to MC methods, including the moment equation method (e.g., ; Jarman & Tartakovsky, 2013; Neuman, 1993; D. M. Tartakovsky et al., 2003) and polynomial-chaos-based approaches (Li & Tartakovsky, 2020; Lin & Tartakovsky, 2010; Tipireddy et al., 2020), surrogate models (X. Yang et al., 2018), and generative physics-informed machine learning methods (L. Yang et al., 2019).
4 Hanford Site Case Study
We compare the performance of the PICKLE and MAP methods for parameter estimation in the steady-state two-dimensional groundwater model of the Hanford Site. The Hanford Site is a former nuclear production complex on the Columbia River in the U.S. state of Washington and is currently operated by the United States Department of Energy. It is the most contaminated nuclear site in the US and is the focus of many modeling (e.g., Cole et al., 2001) and remediation (e.g., Burger et al., 2020) studies. In this comparison study, we use two reference transmissivity fields y(x) = ln T(x) and boundary conditions and that are based on the three-dimensional Hanford Site calibration study (Cole et al., 2001). This calibration study was performed on the unstructured quadrilateral grid shown in Figure 1a with 4–17 horizontal layers depending on the Cartesian plane coordinates and produced an estimate of the three-dimensional conductivity field. We obtain the first reference transmissivity field by depth averaging the conductivity field over the unstructured mesh.
The original lateral mesh contains three different mesh resolutions and includes both the western and eastern banks of the Columbia River (the “Columbia River” cells are highlighted in blue in Figure 1a). We simplify the mesh by removing the river cells and prescribing Dirichlet BC on the western side of the Columbia River and by coarsening the mesh to achieve a uniform resolution, as shown in Figure 1b. For mesh coarsening, we merge groups of fine cells into a single coarse cell while ensuring that the resulting coarse mesh remains boundary-conforming and that the coarse cells are quadrilateral. The resulting mesh has 1475 cells. The transmissivity of each coarse cell is computed as the geometric average of the transmissivities of the fine cells. The transmissivity field corresponding to the coarse mesh is shown in Figure 2. We refer to this field as reference field 1 (“RF1”). We note that the PICKLE method can employ the FV (as in this study) or finite elements discretization to evaluate the residuals and, therefore, can utilize a multiresolution mesh.
Figure 2 also shows the locations of some of the wells at the Hanford Site. We note that the calibration study (Cole et al., 2001) gives coordinates of 558 wells at the Hanford Site but that some of these wells are located in the same coarse or fine-cells. Because the FV discretization exclusively uses cell centers to denote spatial locations, multiple wells are considered as one measurement if they are located within the same coarse cell. As a result, there are 323 wells in the FV model shown in Figure 2.
We hypothesize that the accuracy of the PICKLE method depends on the smoothness of the reference transmissivity field. To test this hypothesis, in Section 5.2 we generate the reference field 2 (“RF2”) transmissivity field by using GPR (Equation 17) with 50 measurements drawn from the RF1 field at the locations randomly picked from the locations of the wells. By construction, the RF2 field is smoother than the RF1 field. In Section 5.2 we compare the performance of the PICKLE and MAP for the RF2 field. We also study the performance of PICKLE relative to MAP as a function of the FV model resolution. For this, we generate a higher-resolution mesh by splitting each cell in the mesh in Figure 1b (1× resolution) into four (4× resolution) equal-area cells, resulting in 5,900 cells. We note that there are 408 wells at this resolution.
The Dirichlet and Neumann boundaries and were defined in the Hanford Site calibration study and are shown in Figure 2. This calibration study also provides the estimates of the head and fluxes at these boundaries. In setting boundary conditions for our comparison study, we assume that is known (in Sections 5.1 and 5.2, is given by the aforementioned calibration study, and in Section 5.4, is modified from the calibrated values to simulate the changing water levels in the Columbia and Yakima Rivers). In the case study with unknown , in the MC simulations we assume a normal distribution for with the mean and variance computed from the values in the calibration study. In the case study where we assume that is known, we take the values of from the calibration study.
For each reference log-transmissivity field yref(x) = ln Tref(x), we generate the reference hydraulic head field uref(x) by solving the Darcy flow equation on the corresponding mesh with Dirichlet and (deterministic) Neumann boundary conditions set as described above. Then, we randomly pick well locations and treat the values of yref at these locations as y measurements. We assume that u measurements are available at all well locations and the values of uref at these locations are taken as the u measurements. These synthetic data sets are used in the PICKLE and MAP methods to estimate the y(x) and u(x) fields.
All PICKLE and MAP simulations are performed using a 3.2 GHz 8-core Intel Xeon W CPU and 32 GB of 2666 MHz DDR4 RAM. The TPFA solver and the PICKLE and MAP methods are implemented in Python using the NumPy and SciPy packages. The weights in the PICKLE and MAP minimization problems are empirically found to minimize the error with respect to the reference y fields as β = 10, α = 10−4, and γ = 10−4. When a reference field is not known, these weights could be found using the standard cross-validation methods (Picard & Cook, 1984).
5 Numerical Experiments
5.1 RF1 Reference Field
Therefore, in all cases considered in this section we are using the regularization given by Equation 13.
Figure 3 shows the distribution of point errors in the PICKLE and MAP estimates of y relative to the RF1 y field obtained with , 50, 100, and 200 y observations. For the considered measurement locations, the PICKLE and MAP methods have comparable accuracy for , with MAP being more accurate for .
Because the inverse problem for y is ill-posed, the regularized PICKLE and MAP solutions depend not only on the number of measurements but also on the measurement locations. To study the effect of the measurement locations on the PICKLE and MAP estimation errors, for each value , we randomly generate 10 distributions of y measurement locations and estimate y for each of these locations distributions. Table 1a shows the ranges of relative ℓ2 and absolute ℓ∞ errors in the PICKLE and MAP y estimates as well as the number of iterations of the minimization algorithm and the execution time (in seconds) for ranging from 25 to 400. For comparison, we also show errors in y estimated via GPR (Equation 17). The ℓ∞ error is defined as the maximum of (i = 1, …, N), where and yref(xi) are the values of the estimated and reference y fields at xi ∈ Xc.
(a) Unknown Neumann boundary conditions | ||||||||||
Solver | 25 | 50 | 100 | 200 | 400 | |||||
Least square iterations | PICKLE | 93–357 | 17–64 | 14–30 | 13–23 | 18–51 | ||||
MAP | 12–35 | 33–483 | 22–535 | 27–603 | 25–400 | |||||
Execution time (s) | PICKLE | 180.43–495.34 | 159.94–592.47 | 142.71–300.60 | 144.73–253.47 | 143.26–378.35 | ||||
MAP | 402.08–1613.52 | 126.52–2134.91 | 70.40–2283.90 | 121.87–2737.72 | 116.74–1928.08 | |||||
Relative ℓ2 error | GPR | 0.175–0.244 | 0.147–0.180 | 0.119–0.156 | 0.095–0.114 | 0.069–0.083 | ||||
PICKLE | 0.130–0.285 | 0.100–0.152 | 0.083–0.119 | 0.076–0.090 | 0.056–0.069 | |||||
MAP | 0.095–0.130 | 0.085–0.100 | 0.076–0.576 | 0.067–0.296 | 0.052–0.062 | |||||
Absolute ℓ∞ error | GPR | 6.21–9.71 | 5.64–8.27 | 4.24–8.00 | 3.79–7.39 | 3.74–5.87 | ||||
PICKLE | 4.61–7.57 | 4.52–6.40 | 4.36–6.10 | 4.17–6.57 | 3.68–5.16 | |||||
MAP | 3.91–6.29 | 4.07–6.51 | 4.06–79.01 | 3.72–39.49 | 3.68–6.30 | |||||
(b) Known Neumann boundary conditions | ||||||||||
Solver | 25 | 50 | 100 | 200 | 400 | |||||
Least square iterations | PICKLE | 47–158 | 31–78 | 37–85 | 29–55 | 28–54 | ||||
MAP | 7662–15694 | 5343–13247 | 3497–19048 | 1488–4143 | 1081–2626 | |||||
Execution time (s) | PICKLE | 155.06–510.83 | 120.67–305.00 | 137.37–324.25 | 119.43–216.47 | 119.43–216.47 | ||||
MAP | 1544.52–3133.03 | 1072.23–2582.03 | 726.96–3649.97 | 320.33–836.32 | 232.30–545.90 | |||||
Relative ℓ2 error | PICKLE | 0.104–0.347 | 0.092–0.209 | 0.081–0.109 | 0.071–0.083 | 0.057–0.064 | ||||
MAP | 0.092–0.107 | 0.083–0.100 | 0.074–0.085 | 0.064–0.071 | 0.050–0.069 | |||||
Absolute ℓ∞ error | PICKLE | 4.55–9.51 | 4.87–8.76 | 4.32–5.45 | 3.62–5.43 | 3.48–4.99 | ||||
MAP | 5.37–6.57 | 4.91–6.39 | 3.64–6.40 | 3.65–6.45 | 3.15–5.55 |
As expected, the accuracy of both methods increases with . The PICKLE method is on average slightly less accurate than MAP in terms of both ℓ2 and ℓ∞ errors. However, MAP is more sensitive to the measurement locations. For example, for and 200, we observe that in MAP the maximum ℓ2 errors are 0.576 and 0.296, respectively, versus 0.119 and 0.090 in PICKLE. We attribute the higher robustness of PICKLE relative to MAP with respect to measurement locations to the regularization effect of the CKLE representation of y. We also note that GPR has significantly larger errors than those in PICKLE and MAP for all considered examples.
Table 1a also shows that the computational cost of PICKLE is significantly smaller than the cost of MAP, and the cost difference increases with increasing . Note that we give the total execution time of PICKLE that includes the cost of MC evaluation of the mean and covariance of u (approximately 28 s), GPR (approximately 0.4 s), and eigendecomposition (approximately 2.9 s). The computational cost of GPR for the considered problem is negligible relative to both PICKLE and MAP, and we do not show it in this table. As with the estimation errors, we observe that the computational cost of PICKLE is significantly less sensitive to the measurement locations than that of MAP. For example, the ratio between the PICKLE maximum and minimum execution times in 10 realizations for and 400 are 2.74 and 2.64, respectively. In MAP, for the same values of , these ratios are 4.01 and 16.51. The larger variability in the MAP computational time corresponds to the larger variability in the number of iterations in the MAP's least-square minimization algorithm.
Next, we investigate the performance of the PICKLE and MAP methods as functions of when the Neumann boundary conditions are known. We note that the GPR approach for estimating y is based solely on y measurements and, therefore, is independent of the boundary conditions. Therefore, we do not present GPR errors in this comparison study. Table 1b shows the errors and execution times of the PICKLE and MAP methods for the same sets of y measurements as in the unknown Neumann boundary condition cases. We find that the errors of both methods only slightly decrease (less than 5%) relative to the unknown Neumann boundary condition cases. The execution time of PICKLE is practically not affected by whether the Neumann boundary conditions are known deterministically or stochastically, while the MAP execution time is increased.
We hypothesize that increasing the number of KL terms in the CKLE of y should increase the accuracy of PICKLE because it allows capturing more accurately the spatial correlation structure of y(x). However, increasing the number of KL terms also increases the number of unknown parameters and, therefore, the computational cost of PICKLE. In Tables 2a and 2b, we compare the errors and execution time of PICKLE with 1000 and 1400 terms in the y CKLE for the cases with unknown and known boundary conditions, respectively. We observe that, contrary to our hypothesis, increasing the number of KL terms does not lead to a significant increase in the accuracy of PICKLE. The ℓ2 error decreases slightly, with only significant (10%) improvement for the smallest considered number of y measurements. This is because Ny = 1000 already corresponds to a very small value of rtoly = 6.4 × 10−6. The further increase in Ny does not significantly improve the approximation power of the CKLE but does render solving the minimization problem costlier. We observe a slight increase in the ℓ∞ errors because larger number of KL terms might require a stronger regularization (i.e., larger values of β). On the other hand, the increase in Ny leads to a significant increase in the execution time of PICKLE, by approximately a factor of 4 for and a factor of 2 for , 200, and 323.
(a) Unknown Neumann boundary conditions | ||||||||
KL terms | 50 | 100 | 200 | 323 | ||||
Least square iterations | 1000 | 36 | 19 | 13 | 9 | |||
1400 | 67 | 15 | 10 | 9 | ||||
Execution time (s) | 1000 | 317.18 | 166.63 | 115.46 | 81.59 | |||
1400 | 1374.94 | 306.18 | 202.58 | 179.37 | ||||
Relative ℓ2 error | 1000 | 0.138 | 0.104 | 0.092 | 0.076 | |||
1400 | 0.114 | 0.100 | 0.088 | 0.076 | ||||
Absolute ℓ∞ error | 1000 | 6.49 | 5.07 | 5.36 | 5.20 | |||
1400 | 6.57 | 5.49 | 5.50 | 5.32 | ||||
(b) Known Neumann boundary conditions | ||||||||
KL terms | 50 | 100 | 200 | 323 | ||||
Least square iterations | 1000 | 23 | 18 | 13 | 9 | |||
1400 | 42 | 12 | 11 | 9 | ||||
Execution time (s) | 1000 | 199.28 | 169.18 | 116.97 | 80.88 | |||
1400 | 887.05 | 238.54 | 226.61 | 179.82 | ||||
Relative ℓ2 error | 1000 | 0.133 | 0.102 | 0.090 | 0.074 | |||
1400 | 0.116 | 0.102 | 0.088 | 0.074 | ||||
Absolute ℓ∞ error | 1000 | 6.43 | 5.14 | 5.38 | 5.37 | |||
1400 | 6.51 | 5.51 | 5.42 | 5.39 |
5.2 RF2 Reference Field
Here, we estimate y using the synthetic measurements of y and u generated on the coarse and fine meshes for the RF2 reference field. We assume that u measurements are available at all wells, that is, and 408 on the coarse and fine meshes, respectively. As in Section 5.1, the number of KL terms in the y and u expansions is set to Ny = Nu = 1000. The corresponding relative tolerances for these choices of Ny and Nu are rtolu = 3.01 × 10−9 and rtoly = 7.9 × 10−6, respectively. Opposite to our results for the RF1 field, here we find that the regularization of Equation 12 provides more accurate results than the regularization of Equation 13. For 10 different spatial distributions of 50 observations of y, the relative ℓ2 errors in the estimated y field are in the ranges of 0.008–0.028 and 0.041–0.078 for regularizers given by Equations 12 and 13, respectively. Therefore, in this section we are using the regularization (12).
Figure 4 shows the RF2 reference y field and the point errors in the PICKLE and MAP estimates of the y field on the coarse mesh obtained using , 25, 50, and 100 for the unknown Neumann boundary condition case. The locations of y measurements are randomly selected from the well locations. Table 3 lists the ranges of ℓ2 and ℓ∞ errors in the y estimates as functions of obtained with the PICKLE, GPR, and MAP methods. For each , 10 different random spatial distributions of measurement locations are selected to compute these ranges. Subtables (a) and (b) give results for unknown and known Neumann boundary conditions, respectively. PICKLE's ℓ2 errors are smaller than those of MAP for and 100. For and 25, the lower bounds of ℓ2 errors are smaller for PICKLE and the upper bounds are smaller for MAP. The absolute ℓ∞ errors follow the same pattern as the ℓ2 errors.
(a) Unknown Neumann boundary conditions | |||||||
Solvers | 10 | 25 | 50 | 100 | |||
Least square iterations | PICKLE | 15–34 | 11–18 | 11–14 | 9–11 | ||
MAP | 51–156 | 14–87 | 14–62 | 14–144 | |||
Execution time (s) | PICKLE | 63.5–110.2 | 85.3–138 | 81.5–105 | 83.4–114 | ||
MAP | 66.2–202.4 | 28.4–191 | 29.5–138 | 38.1–306 | |||
Relative ℓ2 error | GPR | 0.089–0.165 | 0.069–0.100 | 0.040–0.078 | 0.023–0.034 | ||
PICKLE | 0.036–0.161 | 0.017–0.052 | 0.009–0.028 | 0.006–0.006 | |||
MAP | 0.038–0.070 | 0.029–0.044 | 0.021–0.032 | 0.017–0.021 | |||
Absolute ℓ∞ error | GPR | 3.34–4.45 | 2.05–4.29 | 2.11–4.36 | 1.04–2.37 | ||
PICKLE | 1.16–3.61 | 0.831–1.51 | 0.792–1.20 | 0.781–0.854 | |||
MAP | 1.18–1.53 | 1.08–1.44 | 0.858–1.49 | 0.790–1.09 | |||
(b) Known Neumann boundary conditions | |||||||
Solvers | 10 | 25 | 50 | 100 | |||
Least square iterations | PICKLE | 14–30 | 11–17 | 11–14 | 9–11 | ||
MAP | 61–109 | 17–193 | 14–53 | 14–47 | |||
Execution time (s) | PICKLE | 58.2–107 | 53.3–71.8 | 53.5–68.2 | 72.8–94 | ||
MAP | 66.6–126 | 72.1–122 | 66.4–121 | 39.5–326 | |||
Relative ℓ2 error | PICKLE | 0.030–0.075 | 0.015–0.048 | 0.008–0.028 | 0.006–0.008 | ||
MAP | 0.035–0.056 | 0.028–0.039 | 0.020–0.030 | 0.017–0.020 | |||
Absolute ℓ∞ error | PICKLE | 1.14–2.16 | 0.834–1.46 | 0.794–1.28 | 0.734–0.985 | ||
MAP | 1.02–1.48 | 1.02–1.45 | 0.873–1.47 | 0.800–1.08 |
For this coarse resolution, the execution time of PICKLE is larger than that of MAP. Both PICKLE and MAP perform well for unknown Neumann boundary conditions with estimation errors being only slightly larger than those in the case of known Neumann boundary conditions. The ℓ2 and ℓ∞ errors in estimating the RF2 field are significantly smaller than those in estimating the RF1 field, which is not surprising given the relative smoothness of the RF2 field. For the same reason, the execution times of both PICKLE and MAP methods are significantly smaller for modeling measurements from RF2 than RF1.
Next, we study the relative ℓ2 error in the PICKLE solution for y(x) as a function of Ny and Nu, the number of terms in the CKLE of y and u, respectively, for . For simplicity, we set Ny = Nu. Figure 5 shows that the ℓ2 error decreases as Ny increases and, for the considered RF2 field, reaches the asymptotic value of less than 0.07 at Ny ≈ 800. Therefore, rtoly of the order of 10−6 (which is used in this work) and the corresponding Ny = 1000 are sufficient to obtain an accurate approximating of the RF2 y field and the corresponding reference u field. We note that for the (diffusion-type) Darcy equation, the solution u(x) is always smoother than the parameter field y(x). Therefore, the computational cost of PICKLE can be reduced by setting rtolu = rtoly, which for diffusion equations would result in Nu < Ny.
Finally, we test the relative performance of the PICKLE and MAP methods as a function of the resolution of the flow model by estimating y and u using the finer mesh with N = 5900. Table 4 lists the ranges of ℓ2 and ℓ∞ errors in the PICKLE, GPR, and MAP estimates of y as functions of , as well as execution times obtained from 10 different random distributions of measurements for each value of . At this resolution, PICKLE is more accurate than MAP for most considered configurations and numbers of measurements. Tables 4a and 4b show results for unknown and known boundary conditions, respectively. Figure 6 shows the RF2 y field with the resolution N = 5900 and the point errors of the PICKLE and MAP estimates of this y field obtained with , 25, 50, and 100, and unknown flux boundary conditions. It follows from Table 4 that the PICKLE ℓ2 errors are smaller than those of MAP except the upper ranges of the errors for and 25. The lower bound of ℓ∞ errors is lower for the PICKLE method (except for ) while the upper bound is larger (except for ). The errors for unknown Neumann boundary conditions are slightly larger in both methods than those in the case with known Neumann boundary conditions.
(a) Unknown Neumann boundary conditions | ||||||||
Solvers | 10 | 25 | 50 | 100 | ||||
Least square iterations | PICKLE | 14–35 | 11–16 | 11–15 | 9–11 | |||
MAP | 196–288 | 102–234 | 78–199 | 69–205 | ||||
Execution time (s) | PICKLE | 208–290 | 188–203 | 196–217 | 206–265 | |||
MAP | 12,459–16944 | 6031–12374 | 3977–10121 | 4072–8186 | ||||
Relative ℓ2 error | GPR | 0.089–0.165 | 0.068–0.098 | 0.043–0.077 | 0.020–0.032 | |||
PICKLE | 0.040–0.099 | 0.013–0.063 | 0.011–0.029 | 0.004–0.013 | ||||
MAP | 0.041–0.065 | 0.035–0.048 | 0.030–0.037 | 0.023–0.027 | ||||
Absolute ℓ∞ error | GPR | 2.95–4.40 | 2.06–4.19 | 2.18–3.80 | 0.80–2.10 | |||
PICKLE | 1.70–2.99 | 0.715–2.32 | 0.476–2.04 | 0.348–0.834 | ||||
MAP | 1.33–1.51 | 1.18–1.55 | 0.961–1.48 | 0.877–1.30 | ||||
(b) Known Neumann boundary conditions | ||||||||
Solvers | 10 | 25 | 50 | 100 | ||||
Least square iterations | PICKLE | 14–37 | 11–15 | 11–15 | 8–11 | |||
MAP | 65–109 | 69–128 | 84–136 | 86–129 | ||||
Execution time (s) | PICKLE | 193–200 | 181–200 | 192–217 | 192–223 | |||
MAP | 6096–9333 | 6277–10348 | 4228–7124 | 4520–6330 | ||||
Relative ℓ2 error | PICKLE | 0.030–0.080 | 0.012–0.063 | 0.010–0.028 | 0.004–0.010 | |||
MAP | 0.036–0.057 | 0.029–0.041 | 0.024–0.033 | 0.020–0.025 | ||||
Absolute ℓ∞ error | PICKLE | 1.65–2.80 | 0.709–2.29 | 0.476–1.93 | 0.302–0.823 | |||
MAP | 1.33–1.60 | 1.06–1.69 | 0.900–1.62 | 0.922–1.37 |
5.3 Scaling of the Execution Time With the Problem Size
Comparing Tables 3 and 4 it can be seen that the execution times of both PICKLE and MAP increase with increasing mesh resolution; however, the execution time of PICKLE increases slower than that of MAP. To further study the dependence of the computational cost of PICKLE and MAP on resolution, in Figure 7 we plot the execution times of these methods as functions of N for both the RF1 and RF2 reference fields. An additional mesh with N = 23,600 FV cells is generated by dividing each cell in the mesh with N = 5900 into four cells. The number of y measurements in all simulations is set to . Figure 7 also shows that a power-law model fits the execution time for both methods. We note that for N = 23,600, the MAP method did not converge after running for two days. Therefore, the power law relationships for the MAP method are obtained based on the execution times for N = 1475 and 5900 and used to estimate MAP's execution times for the highest resolution.
From Figure 7, we see that the PICKLE and MAP execution times increase as N1.1 and N3.2, respectively, for both the RF1 and RF2 fields. The close to linear dependence of PICKLE's execution time on the problem size gives it a computational advantage over the MAP method.
5.4 Modeling u and y Measurements Corresponding to Varying Boundary Conditions
In many natural systems such as the Hanford Site, boundary conditions can change with time. Once “trained” for one set of boundary conditions, PICKLE can be used without additional retraining to model data corresponding to different boundary conditions. Specifically, the covariance kernel Cu(x′, x″) that is calculated from MC simulations for certain boundary conditions can be used to estimate y and u using measurements that correspond to boundary conditions different than . In Appendix D we demonstrate that Cu(x′, x″) depends on the deterministic boundary conditions only through the gradient of the mean hydraulic head field. We also demonstrate that an change in Dirichlet boundary conditions leads to a change in the covariance, where L is the size of the domain in the direction of mean flow. Here, we assume that this change in the covariance can be disregarded and we treat Cu(x′, x″) as independent of the Dirichlet boundary condition values.
As an example, we consider a case where the Dirichlet boundary condition incrementally changes with time over the range (, ) in response to changes in the water level in the Columbia and Yakima Rivers. We denote (i = 1, …, Nt) as the Dirichlet BC at each of Nt discrete times. At the ith time, the measurements are collected at spatial locations. The Neumann boundary conditions are assumed to be unknown. The measurements ys are available at locations (ys do not change in time). To model the data, the covariance function Cu(x′, x″) can be found using the MC method described in Section 3.2 and setting , where is any one BC from . We emphasize that Cu(x′, x″) should be computed for only one member of . This covariance is then employed to solve inverse problems for any member of .
where , and is given by Equation 17. Equation 22 is a mean field equation, which disregards the term , where and .
To test this approach, we assume that the reference y field is given by the RF1 field from which we draw measurements of y at random locations. Furthermore, we assume that u(x) is sampled at locations, and that 3 measurements of u(x) gathered at 3 different times are available at each location forming three vectors of u measurements with the locations (i = 1, 2, 3). In general, the locations of u measurements can change over time, but in this work we assume that the locations of u measurements are the same for all considered boundary condition values. These boundary conditions (i = 1, 2, 3) are constructed as follows: is given by the calibration study (Cole et al., 2001), m, and m. Three reference fields u(i) (x) are computed by solving Equations 1–3 with y(x) given by the RF1 y field and subject to the Dirichlet BCs (i = 1, 2, 3). The vector is drawn from each reference field u(i)(x).
We compute Cu(x′, x″) and from MCS for . The mean u fields and are approximated by solving Equations 22–24. Figure 8 shows the PICKLE estimates and the corresponding point errors with respect to the reference fields u(i) (i = 1, 2, 3). For all three fields, the errors in the estimated u fields are similar, with the average relative errors less than 0.5% and maximum point errors less than 4%. These results show that the PICKLE model trained for one boundary condition ( in this case) can be used to accurately predict the u field for the other boundary conditions. Note that in general, the PICKLE estimate of y from the and ys measurements could depend on i because the parameter estimation is an ill-posed problem. However, for this setting we find that the PICKLE estimates of y are within 0.01% of each other.
6 Conclusions
-
For the synthetic data generated from the RF1 and RF2 y fields, we demonstrated that the MAP and PICKLE execution times scale with mesh resolution as N1.15 and N3.27, respectively, where N is the number of FV cells. The close to linear dependence of PICKLE's execution time on the problem size gives PICKLE a computational advantage over the MAP method for large-scale problems. We consider this to be the main advantage of the PICKLE method
-
For the same number of measurements, the accuracy of PICKLE and MAP depends on the measurement locations. The MAP method is more accurate for the RF1 field, and the PICKLE method is more accurate for the RF2 field for most considered cases
-
The execution time of PICKLE and MAP increases and the accuracy decreases as the roughness of the parameter field increases. This is expected for most inversion methods for ill-posed problems that rely on a smoothness-enforcing regularization. Therefore, for smooth fields, the regularized inverse solutions are expected to be more accurate. For the same reason, iterative optimization methods for inverse problems are expected to converge faster for smooth fields
-
In the PICKLE method, the execution time and accuracy increase with the increasing number of CKL terms. In this work, as a baseline we used Ny = Nu = 1000 that corresponds to rtol < 10−6. We stipulate that this criterion is sufficient to obtain a convergent estimate of y with respect to the number of CKL terms
-
The training of the PICKLE model should be performed only for one value of the boundary conditions and does not need to be updated as the boundary conditions change, which significantly reduces its cost
-
The accuracy of the PICKLE method depends on the ability of the truncated CKLEs to accurately approximate y and u, which requires a certain degree of smoothness of the considered fields. We demonstrated that for y and u fields that are representative of the Handford Site, the CKLE approximations of the fields lead to results that are comparable in accuracy to the MAP method. However, CKLEs can also be used to approximate fields exhibiting step-like changes (e.g., at the boundaries of different geological formations) using a logistic function as was shown in A. Tartakovsky et al. (2020)
-
In the PICKLE method, computing the covariance function of u from MCS can become a computational bottleneck for large-scale problems. However, MCS can be replaced with more computationally efficient alternatives, including the multilevel MC method, generative physics-informed machine learning models, Polynomial Chaos and other surrogate models, and the moment equation method
Finally, it is important to note that the PICKLE method is not limited to steady-state problems. In fact, the extension of the PICKLE formulation (11) to time-dependent problem is straightforward and only requires replacing the evaluation of the residuals of the steady-state flow equation with the time-dependent flow equation. Constructing the CKLE (6) for a time-dependent solution field is the same as for one that is only a function of space. For a time-space-dependent problem, x in Section 6 will be a vector of n spatial coordinates and time. The covariance of u, which is needed to compute the eigenfunctions and values in Section 6, will be in both space and time—however, it can be computed from the MC solution just as in the considered steady-state problem.
Acknowledgments
This research was partially supported by the U.S. Department of Energy (DOE) Advanced Scientific Computing (ASCR) program. Pacific Northwest National Laboratory is operated by Battelle for the DOE under Contract DE-AC05-76RL01830.
Appendix A: Finite Volume Discretization
Appendix B: Computing MAP Estimates
Appendix C: Solver Optimization
We implemented our solvers for both PICKLE and MAP in Python. For both methods, we employ TPFA-FV as the forward problem. Although we did not parallelize the solvers used in this work, we optimized the codes in several ways described as follows.
C1 Precomputing Matrices
Because the properties of each cell, the observation locations of us, ys, and the topology of the cell connections are fixed, the structure (i.e., the location of non-zero entries) of the observation matrices Hu, Hy, the regularization matrix D, the stiffness matrix A in Equation 4, and the partial derivatives in the first block row of Equation B3 also remain unchanged throughout the least-squares minimization of Equations 5 and 14. Thus, these fixed structures can be identified in advance, and only the values of A and the partial derivatives in Equation B4 need to be updated at each minimization iteration. In addition, when the boundary conditions are known and constant in time, the aggregated contribution of the prescribed hydraulic head and normal flux to each FV cell i— and in Equation A2—can also be precomputed. For MAP, the second to the fourth block rows and Hu in the first block row of the Jacobian in Equation B3 are also constant throughout minimization because they only depend on the topology of the mesh. Therefore, these elements can also be precomputed ahead of time.
C2 Sparsity
Sparsity is maintained throughout the evaluations of the objective functions of both PICKLE and MAP, including the residual l(u, y, q) in Equation 4, as well as their corresponding Jacobian matrices. This significantly reduces the storage and computation overhead because the increase in the resolution of the mesh quadruples the size of the matrices. However, the SciPy implementation of the sparse linear solver (spsolve) does not support sparse right-hand-side vectors and matrices. Furthermore, partial solvers that only compute solutions at measurement locations and reuse sparse structural reordering are not supported by the package. Future optimization using these techniques would further reduce the execution times of both the MAP and PICKLE methods.
Appendix D: Perturbative Expression for the Head Covariance
In this section, we aim to derive closed-form perturbative approximations to the covariance of hydraulic head by treating the transmissivity field as a random field. Let denote ensemble averaging. We assume that the transmissivity field is written as ensemble average, , plus a zero-mean random fluctuation T′(x), that is, , with . Similarly, the hydraulic head is decomposed into its ensemble mean and a zero-mean deviation, that is, , with .
In Equation D6, A(x, z) depends on the homogeneous Dirichlet and Neumann boundary conditions, while depends on the heterogeneous Dirichlet and Neumann boundary conditions that describe the actual boundary conditions of the modeled system. Therefore, to compute for any heterogeneous boundary conditions one only needs to solve the deterministic Equation D8 subject to these heterogeneous boundary conditions. The expensive part of computing is to evaluate A(x, z), which needs to be done only once for the homogeneous boundary conditions.
Evaluating A(x, z) from Equation D5 requires computing the Green's function, which involves solving the deterministic problem (EquationsD2–D4) N times (this number can be reduced if the symmetry G(y, x) = G(x, y) is taken into account), where N is the number of nodes in the computational domain discretization. In this work, we employ a different strategy. We compute for one set of heterogeneous boundary conditions (and ) corresponding to the average state of the modeled system using the Monte Carlo simulation method described in Section 3.2. Next, we perform the following order of magnitude analysis to describe the change in due to an ϵ change in :
which states that an ϵ change in leads to an O(ϵ/L) change in the covariance of hydraulic head. At the Hanford Site, L is on the order of 104 m and ϵ (the natural variations in Dirichlet boundary conditions reflecting water level changes in the Columbia river) is on the order of 1 m. Therefore, for the Hanford Site, the natural variations in the Dirichlet boundary condition lead to changes in . We assume that this sufficiently small change can be disregarded, and that the covariance , which is computed for the boundary condition , can be used to model u data corresponding to the boundary condition .
Open Research
Data Availability Statement
The data and codes used in this paper are available at https://zenodo.org/record/6512404#.YnA5hPPMKBR.