# Hydrogeological Modeling and Water Resources Management: Improving the Link Between Data, Prediction, and Decision Making

## Abstract

A risk-based decision-making mechanism capable of accounting for uncertainty regarding local conditions is crucial to water resources management, regulation, and policy making. Despite the great potential of hydrogeological models in supporting water resources decisions, challenges remain due to the many sources of uncertainty as well as making and communicating decisions mindful of this uncertainty. This paper presents a framework that utilizes statistical hypothesis testing and an integrated approach to the planning of site characterization, modeling prediction, and decision making. Benefits of this framework include aggregated uncertainty quantification and risk evaluation, simplified communication of risk between stakeholders, and improved defensibility of decisions. The framework acknowledges that obtaining absolute certainty in decision making is impossible; rather, the framework provides a systematic way to make decisions in light of uncertainty and determine the amount of information required. In this manner, quantitative evaluation of a field campaign design is possible *before* data are collected, beginning from any knowledge state, which can be updated as more information becomes available. We discuss the limitations of this approach by the types of uncertainty that can be recognized and make suggestions for addressing the rest. This paper presents the framework in general and then demonstrates its application in a synthetic case study. Results indicate that the effectiveness of field campaigns depends not only on the environmental performance metric being predicted but also on the threshold value in decision-making processes. The findings also demonstrate that improved parameter estimation does not necessarily lead to better decision making, thus reemphasizing the need for goal-oriented characterization.

## Key Points

- We present the risk-based data acquisition design evaluation (RDADE) framework to integrate stochastic analyses with defensible decisions
- We find that improved parameter estimation does not always guarantee lower decision risk
- When dealing with extreme events with high consequences, it is advantageous to improve system resiliency in addition to modeling accuracy

## 1 Introduction

Providing water in sufficient quantity and quality to meet growing demands for agricultural, industrial, and municipal uses is an increasingly complex challenge, requiring water resources managers to make difficult decisions regarding the selection, maintenance, and treatment of water sources in light of increasing scarcity and prevalence of contamination, in both surface water and groundwater. In the United States, groundwater provides drinking water for nearly 150 million people, yet over 20% of the nation's groundwater samples have had at least one contaminant present at levels potentially harmful to human health (DeSimone et al., 2014).

### 1.1 Environmental Performance Metrics and Water Resources Management

Sustainable groundwater management requires managers to make decisions based on answers to crucial questions regarding the quantity and quality of groundwater resources. For example, a water district manager needs to make decisions on when and where to divert water to storage facilities, so that the district has water that is suitably clean (i.e., that all contaminant concentrations are below the limit of the treatment system) and sufficiently ample (i.e., that there is an acceptable low risk of failure to supply the fluctuating domestic water demand). In these cases, the hydrological/hydrogeological variable(s) involved in these types of questions may be the arrival time of a contaminant plume at a water intake, the groundwater and contaminant flux passing through a specific area over a given period, or the contaminant concentration at a specific location or time. Such variables have been referred to as environmental performance metrics (EPMs; De Barros et al., 2012), the prediction of which helps water managers answer the aforementioned questions.

### 1.2 Field Data, Uncertainty, and Risk

Answering the above types of questions with complete certainty is impossible in practice due to the many challenges in site characterization, modeling, and decision making. Farber and Findley (2010) acknowledged the inevitability of this uncertainty, writing, “In many situations, safety cannot be absolute but must entail an acceptable level of risk, however and by whomever that level may be defined” (pp. 172–173). USEPA (2014) also described the importance of acknowledging uncertainty in management decisions, stating that “if uncertainty and variability have not been well characterized or acknowledged, potential complications arise in the process of decision-making” (p. 6).

Uncertainty could arise from several sources. Among all types of uncertainty, perhaps the one most commonly addressed and studied concerns the uncertainty of the parameters in the model(s) applied to estimate the EPMs, and how uncertainty in these parameters creates uncertainty in EPM predictions. Typically, these uncertainties are reduced by characterizing the site by acquiring field data and then using the data in inverse modeling to estimate parameters and the uncertainties in the estimates. Substantial research has developed methodologies for these processes and thus is not the focus of this paper, but more information can be found in, for example, Hubbard and Rubin (2000), Kowalsky et al. (2004), Chen et al. (2004), Hou and Rubin (2005), Rubin et al. (2010), and Savoy, Heße, et al. (2017).

Due to the inherent assumptions and the conceptual setup of each model, it is also important to consider uncertainty in conceptual models and assumptions, as mentioned by, for example, Refsgaard et al. (2007). A more detailed discussion of the sources and types of uncertainty, along with how they can be addressed, is provided in section 5.

- How much uncertainty is acceptable?
- How does one obtain the necessary information for such an acceptable level of uncertainty?

The answer to the first question is nontrivial and ideally would be determined by considering costs (i.e., how costly are data and analyses) and consequences (i.e., what is at stake) with input from multiple stakeholders including regulators, site owners, site users, and the general public. Thus, the answer to question 1 is beyond the scope of this paper. The goal of this paper is to provide a framework by which the second question can be answered.

The answer to the second question is closely related to the spatiotemporal scale of interest (e.g., whether we are interested in the instantaneous peak contaminant concentration or the total contaminant mass over an area or period); the physical response being modeled (e.g., whether we are interested in the spread of the contaminant plume or the residual concentration after the main plume has been carried away by advection); and the efficacy of the sampling campaign where the data are obtained.

### 1.3 Previous Work and Remaining Challenges

The efficacy of a sampling campaign can be hard to define due to the complex process by which the sampled data are used to condition the conceptual, statistical, and/or spatiotemporal variability models, all of which is a precursor to making EPM predictions and, ultimately, decision making. Abellan and Noetinger (2010) demonstrated a method for optimizing subsurface field campaign designs. However, the objective by which “optimal” was defined was inferring the most accurate geostatistical model (Figure 1, arrow A) and not necessarily the most accurate EPM predictions or most successful decision making. It is this added complexity that highlights the importance of a goal-oriented site characterization (e.g., De Barros & Rubin, 2008; De Barros et al., 2012; Nowak et al., 2010; Savoy, Kalbacher, et al. 2017).

There is often no direct interplay between hydrogeological considerations and other considerations related to management, regulation, and the general public. However, in some applications, such as when assessing a health risk to a potentially exposed population, hydrogeological characterization and modeling play only one part in the overall risk assessment (De Barros & Rubin, 2008; Maxwell et al., 1999; Rubin et al., 2018). It is thus important to adopt a goal-oriented perspective (Figure 1, arrows B, C, and D), where considerations regarding all aspects revolve around the key management variable—the risk of making a wrong decision—which, in turn, shape the sampling campaign design. The conceptual difference between goal orientation and parameter orientation can be exemplified by the different definitions of what a “good” campaign design entails. For instance, Abellan and Noetinger (2010) presented a method of optimizing field campaigns based on the associated information gain, defined as the Kullback-Leibler divergence of the posterior distribution with respect to the prior distribution of the geostatistical parameters. The concept here is to maximize the information gain in order to have a better understanding of the distribution of the parameters of interest. On the other hand, Nowak et al. (2012) integrated optimization with hypothesis testing, where optimality was defined by the lowest decision risk at a fixed cost of a sampling campaign or, inversely, the lowest cost to achieve an acceptably low risk.

However, a challenge exists when applying the approach presented by Nowak et al. (2012). As illustrated in Figure 1, the main feature of goal-oriented design is the inclusion of considerations from management, regulation, and the general public. Under the framework proposed by Nowak et al. (2012), any constraints or considerations must be quantified and codified into an optimization algorithm, which can be challenging or impossible, due to the elaborate relationship that exists between industry, society, government, and the general public (Brulle & Pellow, 2006; Rubin et al., 2018). After arriving at the optimal design, it must be then checked for conformity to all the applicable noncodifiable constraints and considerations on an ex post facto basis. To that end, it is likely that some quantitatively suboptimal designs are in fact qualitatively “better,” due to greater compliance with noncodifiable rules, which motivates use of a proposal-evaluation approach as opposed to an automated optimization algorithm, which will be discussed in the following subsection. In addition to codifiability, challenges remain related to the subjectivity of some considerations as described above. For example, it is difficult to decide whether to save $10,000 on sampling for a 1% risk increase—there is no “correct” answer, but rather a complex interplay between regulations, socioeconomics, politics, and hydrogeology as mentioned before. This is why we consider the first question above to be outside the scope of this paper.

### 1.4 Present Contribution

To address the aforementioned challenges, this paper presents the risk-based data acquisition design evaluation (RDADE) framework. RDADE facilitates communication concerning uncertainty and risk between multiple stakeholders, including managers, regulators, and the general public in addition to hydrogeologists (see Figure 1). The idea is that when hydrogeologists communicate forecasts, uncertainty, and decisions to other stakeholders, the focus should be on the operational decisions along with the consequence and probability of an erroneous decision. Thus, RDADE provides a mechanism to summarize the uncertainty from all steps, including site characterization, inverse modeling, and forward modeling. In this way, the focus of the communications can revolve around environmental risk and making operable decisions based on a balance between consequence of erroneous decisions and the probability of such errors.

- It allows for adherence to any number of constraints and considerations, regardless of codifiability and quantifiability.
- It reduces computational effort, since the objective function is only evaluated on a few feasible proposals, rather than done numerous times throughout the execution of an optimization algorithm.
- It provides decision makers a list of agreed-upon and feasible alternative designs, which thus allows for the transparent treatment of subjective considerations and weighing costs of increased certainty versus acceptable probability of erroneous decisions.
- It allows more flexible treatment of epistemic uncertainty, especially in nontechnical components of decision making (see section 5 for a detailed discussion).

There are many scenarios where the RDADE framework could be applied. Examples include determining whether or not a migrating contaminant plume will reach nearby water supplies, determining whether contaminant concentration will exceed the Maximum Contaminant Level (MCL, the highest concentration of chemicals permitted in drinking water systems), determining how large a well protection zone should be, or designing flood control structures (e.g., Bolster et al., 2009, 2013; Enzenhoefer et al., 2012; Frind et al., 2006). This paper presents the RDADE framework and demonstrates its use in a synthetic case study predicting contaminant arrival time at a water intake. Since the focus of the paper is demonstration of RDADE, the case study is relatively simple in order to allow conciseness and clarity.

The remainder of this paper is structured as follows. Section 2 presents the theoretical framework for RDADE. Section 3 provides an overview of a synthetic case study. Section 4 presents our findings and offers a discussion of the case study. Section 5 provides a detailed discussion of the various sources of uncertainty that can be accounted for by RDADE and offers suggestions on how to account for the others. Lastly, section 6 provides the summary and conclusions of this work.

## 2 Theoretical Framework: RDADE

This section presents the details of the RDADE framework, which involves a two-level nested hypothesis testing. In the first layer, hypothesis testing is used to make a decision about the relevant environmental system, using all available data. In the second layer, we randomize the environmental system in order to probabilistically evaluate the effectiveness of a proposed data acquisition design.

### 2.1 First Level: Conditional Decision Making

*EPM*

_{critical}). The “critical range” is the range of EPM values that would pose a problematic or dangerous condition (e.g., a concentration above a regulatory threshold or arrival before degradation). With this, we define an environmental risk indicator variable as follows:

*I*indicating that risk indicator

*I*is the subject of these hypotheses. The binary nature of the hypotheses and the decision making leads to the typical confusion matrix, which consists of four possibilities based on both what is actually true and what we determine to be true, that is, the decision made. Note that with this setup, the burden of proof is on ; namely, the desirable, risk-free scenario must be supported by convincing evidence before it is accepted. This aligns with safe water resources management: If the safety of our water supply is in question, we remain suspicious until we can reliably demonstrate that it is in fact safe.

*α*), shown as follows:

It is worth noting that *α* is not determined by any engineering calculation or modeling prediction. Rather, *α* is determined by regulation or policy to strike the proper balance between acceptable levels of uncertainty and characterization costs. In other words, if the consequences of a certain type of error are catastrophic, a very low value of *α* would be used. In this way, even if such a catastrophic event were to happen, the decision makers can demonstrate adherence to well-defined probabilistic criteria throughout the process. Referring again to Figure 1, this can be represented by arrows B, C, and D. While not all stakeholders are equipped for detailed discussions of model parameterizations, all stakeholders can participate in discussions weighing costs and benefits related to characterization and risk and determining what levels of certainty are acceptable and at what cost. In an ideal scenario, these probabilistic criteria would be defined by regulation with input by all stakeholders (Rubin et al., 2018).

*g*), which may be a combination of field-scale tests, laboratory tests, or other remote sensing information. The conditional probability of

*I*given the data is denoted as

*Pr*[

*I*=1|

*g*]=⟨

*I*|

*g*⟩. Depending on the

*EPM*under consideration, the amount of information available about the site, and the type of measurements contained in

*g*, the computation of ⟨

*I*|

*g*⟩ can involve several steps. These steps can include consideration of alternative conceptual models, inferring the geostatistical parameters of the relevant variables (e.g., hydraulic conductivity) as well as simulation of physical processes. An informed decision can be made, represented by the decision indicator variable,

*D*

^{g}, shown as follows:

### 2.2 Second Level: Probabilistic Sampling Campaign

*before*the observations are taken, we ascend to the second level of hypothesis testing by treating sampling campaigns probabilistically. Suppose that there are

*N*

_{G}proposed sampling campaign designs that are scientifically sound and practically feasible, each of which is unique with the corresponding observations modeled as a random variable,

*G*

_{j};

*j*=1,…,

*N*

_{G}. If the

*j*th design is adopted, the collected data are denoted by

*g*

_{j}, which also represents a realization of

*G*

_{j}in accordance with conventional random variable notation. To test the efficacy of any design, we define the following two error indicator variables:

where
indicates the occurrence of a Type I error (i.e., erroneously reject
) given significance level *α*; that same holds for
for a Type II error (i.e., erroneously fail to reject
). A schematic diagram is provided in Figure 2, which summarizes the hypotheses, the conditional probabilities, and all the indicator variables in a conventional confusion matrix format.

*w*

_{α}and

*w*

_{β}are weight coefficients selected to quantify the relative significance of Types I and II errors, respectively.

*R*

_{crit}is the maximum allowable decision risk (which, like

*α*, can be determined by regulation or policy) and the superscript of the hypotheses indicates that these are the hypotheses regarding field campaign design,

*G*

_{j}. In other words, is the second-level null hypothesis indicating that

*G*

_{j}is insufficient to ensure an appropriate test of , on which our water resources management decision depends. The alternative hypothesis, , indicates that the design

*G*

_{j}is indeed adequate to enable defensible decision making. Note that in equations 5 and 6, lowercase

*g*

_{j}is used because it considers a set of data from a single field, while in equations 7 and 8 uppercase

*G*

_{j}is used because it considers the observations probabilistically.

In addition to the subjectivity in *α* and *R*_{crit}, here we further emphasize the subjectivity in determining the weights *w*_{α} and *w*_{β}. In other words, we now ask not only what probability of error is acceptable but also what probability of *each type* of error is acceptable. While the obvious interest is keeping water supplies safe, which would motivate a high weight to *w*_{α}, this comes at a cost of being overly conservative, which may have other negative effects. For instance, in some cases a site user (e.g., a farm or factory) may be required to unnecessarily decrease production if *w*_{α} is too high and *w*_{β} is too low, which could cause detrimental economic impacts.

### 2.3 Implementation of RDADE: Simulation-Based Hypothesis Testing

In this subsection, we present a simulation-driven method to practically implement RDADE. The necessity of simulation lies in the fact that in reality, the EPM(s) are unknown and, hence, we could not determine the environmental risk indicator variable in the first place (equation 1). The simulation-based hypothesis testing method presented here allows us to implement RDADE, while accounting for the uncertainties rooted in the fact that the EPM(s), and the models and the parameters needed to estimate them are unknown.

Given that RDADE allows for some subjectivity based on any regulatory or managerial considerations, for demonstration purposes, here we assume a case where falsely assuming the safety of water supplies is more consequential than being overly conservative. Hence, *w*_{α} would take a value of 1, while *w*_{β} would take a value of 0. In addition, we also assume that *R*_{crit}=*α* for simplicity.

The implementation of RDADE starts with generating an ensemble of *N*_{Y} baseline fields, denoted by
, where *N*_{Y} is selected to be sufficiently large to represent the entire range of physically plausible possibilities of the site of interest. Each
is a field of all parameters necessary to compute the EPM(s). For example, for a groundwater flow/transport modeling problem the ensemble could include spatially variable hydraulic conductivity, porosity, and geochemical parameters. The random generation of baseline fields can be done using any method deemed appropriate for the parameters of interest and their spatial structures. The baseline fields can be generated conditional to any knowledge state prior to the sampling campaign; in a Bayesian context, this knowledge state is represented by the prior information on which the prior distribution is based. RDADE, in general, allows for the possibility of competing conceptual models, which may necessitate the generation of multiple ensembles of baseline fields.

Each baseline field
is used for two purposes. The first one is the computation of the baseline EPM (*EPM*_{i}). Depending on the application, computing the EPM may involve any number of models or transfer functions (e.g., hydrological, geochemical, and biological). In other words, *EPM*_{i} represents the value of the EPM, which would occur if
were a true representation of the site under consideration. By comparing *EPM*_{i} with *EPM*_{critical} and following equation 1, we can define the environmental risk indicator variable for each baseline field,
. The addition of the subscript and the superscript for
indicates that it is the environmental risk indicator variable if
were a true representation of the site in question.

The second use of
is to simulate the field campaign and the resulting decision making. This involves simulating the collection of data, *g*_{ij}, where the subscripts indicate the adoption of sampling campaign design *G*_{j} and the assumption of
being true. In simple cases where the quantity being measured by *G*_{j} is a component of the field
(e.g., hydraulic conductivity), the information in *g*_{ij} is merely that quantity from the locations in the field specified by the measurements. If the quantity being measured is not a direct component of
, then some numerical simulations may be necessary to compute *g*_{ij}.

After the simulation of *g*_{ij} we move on to simulate the decision making that would result. In most cases, this involves many steps, including inferring parameters, distinguishing between conceptual models, and forward modeling. RDADE can be applied to any combination of conceptual models, as well as any form of inverse modeling for the model or parameter inference. If there are multiple competing site conceptual models and/or forward models, Bayesian model averaging can be done in simulating a model averaged EPM, which would provide a model averaged estimate of decision risk. The details of the inverse and forward modeling processes, however, are specific to each application and thus are not the focus of the present study.

*g*

_{ij}, serve the dual role in RDADE (1) as the conditional points for geostatistical parameters inference and (2) as conditioning points in the forward modeling. In most cases, this involves simulating an ensemble of conditional fields and executing the relevant forward model(s) to obtain the EPM predictions. Note that the subscripts

*i*and

*j*denote that the field is conditioned on

*g*

_{ij}and the superscript

*c*indicates that this is a conditional field, not a baseline field. With the ensemble of conditional EPM predictions (derived from ), we then obtain an ensemble of conditional environmental risk indicator variables , determined by whether the EPM is within

*EPM*

_{critical}. The probability that the null hypothesis is true, conditional to the simulated field data, is found by the following equation:

Afterward, by following equation 4, we can determine the decision indicator variable,
, which represents the decision we would have made, assuming that
was the real baseline field and that the only information we had was *g*_{ij}.

*G*

_{j}is a successful campaign design. This is done by first determining the probabilities of the errors by averaging over the the baseline fields:

*G*

_{j}is sufficient for testing the hypothesis (note that there is no index of

*i*here because we have averaged over the baseline fields), while the failure to reject indicates the opposite. The entire implementation process can be graphically summarized in a flowchart (Figure 3).

*P*

_{α}and

*P*

_{β}are considered. These are defined as follows:

*I*⟩ is the expected value of

*I*determined from the baseline fields:

While the conditional error probability (
) is usually the focus in classical hypothesis testing, in water resources management, it makes sense to focus on the error occurrence probability (
). This is because in some cases, it may be practically impossible to predict an event of extremely low probability (e.g., a very early arrival time). In this scenario, the probability of occurrence (⟨*I*⟩) would be very low, but the conditional error probability would be very high due to its conditional nature. In other words, if the risk-posing event has an exceptionally low probability of occurrence, no amount of field data could enable managers to predict this event, which would prevent any course of action from being acceptable, as indicated by the conditional probabilities. This effect is demonstrated in section 4.

## 3 Framework Application: Synthetic Case Study

*τ*, of a contaminant plume at a control plane defining a location of interest. This control plane could represent an environmentally sensitive area or a water supply well. The risky scenario would arise if the contaminant plume arrived before a critical amount of time,

*τ*

_{crit}, passed. This scenario could arise in many real-world applications, such as when evaluating locations for waste disposal sites or when assessing the risk posed by a plume to a nearby supply well. Since early arrivals are of concern, the indicator variable is

*α*is 0.05.

### 3.1 Statistical and Physical Setup

Aquifer flow was simulated in a 2-D planar (*x*,*y*) rectangular domain, with constant head boundary conditions along the two boundaries parallel to the *y* axis, and no flow conditions along the boundaries parallel to the *x* axis. The flow was uniform in the average in the positive *x* direction. The porosity, *n*=0.10, was assumed to be known and homogeneous. The total flow domain was 50 numerical grid cells in both dimensions, with each grid cell between
and
, depending on the integral scale (see below). We assumed the contaminant to have originated from an instantaneous point release, where the time and location of the release were also assumed to be known. The point release was located 40 grid cells upstream of the control plane.

*K*(

**x**) is hydraulic conductivity,

*H*(

**x**) is the hydraulic head,

*v*(

**x**) is water velocity, and

*n*is porosity. The unconditional contaminant arrival time was computed with the methods of Schlather et al. (2017) and Pollock (1988). The conditional contaminant arrival times were computed with the method of Rubin (1991).

A set of field measurements were taken to characterize the natural logarithm of hydraulic conductivity, *Y*=*ln*(*K*), which was modeled as a space random function (SRF) with a stationary multivariate Gaussian distribution and exponential covariance. The measurements were used to estimate the parameters
, where *μ*_{Y},
, and *I*_{Y} represented the mean, variance, and integral scale, respectively. The measurements were also used as conditioning points in the forward modeling.

To investigate the effect of prior information, the case study was executed in three alternative scenarios regarding prior distributions. In Scenario 1, the SRF parameters *θ* were assumed to be deterministically known with
. In this scenario, no inference of *θ* was necessary, and the measurements served only as the conditioning points. In Scenarios 2 and 3, all three parameters were assumed to be distributed uniformly and independently of each other. In Scenario 2, *μ*_{Y} was distributed uniformly in the range [−6,−5]. In Scenario 3, *μ*_{Y} was distributed uniformly in the range [−7,−4]. In both Scenarios 2 and 3,
and *I*_{Y} were distributed uniformly in the ranges [0.1,1] and [3,6], respectively. The three prior information scenarios were chosen to provide a range of knowledge states, from deterministic knowledge of SRF parameters (Scenario 1) to probabilistic descriptions of SRF parameters (Scenarios 2 and 3). Scenarios 2 and 3 were selected to represent relatively informative and relatively uninformative knowledge states about the SRF for *Y*, specifically different levels of variability in the mean value. This allowed us a closer examination of the relationship between parametric uncertainty, travel time prediction, and decision-making accuracy.

### 3.2 Field Campaign Setup

The field campaigns to be tested were designed with *n* = 4, 8, 16, and 32 measurements. For each *n*, two alternative designs were tested. One configuration had measurements spread throughout the domain, covering various lag distances with the idea of improving the SRF parameter estimates. The other configuration had all measurements located in the likely area of the travel path. The likely area of the travel path was determined by simulating an ensemble of particle paths conditioned only on the prior information, and lateral displacement was plotted against the distance from the point source. These results are shown in Figure 4. Additionally, the locations of the measurements for all field campaign designs *G*_{j},*j*=1,…,8 are shown in Figure 5.

### 3.3 Monte Carlo Methodology

In accounting for uncertainty in *θ*, Latin hypercube integration was used in order to reduce the computational burden (Rubin, 2003). The ensemble sizes were selected such that resulting distributions demonstrated stability with respect to ensemble size. For Scenarios 2 and 3, 27 and 81 hypercubes were used, respectively. For each hypercube, traditional Monte Carlo sampling was used to simulate 250 baseline fields from the distribution *f*(*Y*|*θ*), providing a total of *N*_{Y}= 250, 6,750, and 20,250 for Scenarios 1, 2, and 3, respectively. The methods described in the following subsections were implemented for each of the three scenarios.

#### 3.3.1 Baseline Fields and Simulated Field Campaigns

In each scenario, the travel time was computed deterministically for every baseline field, yielding
via equation 17. After recording the deterministically known travel time, the field campaign was simulated by recording the values of hydraulic conductivity at the locations specified by the field campaign design. After the simulated field data *g*_{ij} was collected, the measurements were used to compute the *maximum a posteriori* estimate (*θ*_{MAP}) for the geostatistical parameters, similar to the maximum likelihood method presented by Kitanidis and Lane (1985), but with bounds provided by the prior distributions. After *θ*_{MAP} was computed, the conditional distribution of travel time *f*(*τ*^{c}|*θ*_{MAP},*g*_{ij}) was computed using semianalytical particle tracking (Rubin, 1991).

#### 3.3.2 Simulated Decision Making

For each field campaign *G*_{j}, *N*_{c}=250×*N*_{Y} realizations of *τ*^{c} were computed, which led to *N*_{c} realizations of
via equation 17, and, in turn,
was computed using equation 9. From this,
was either accepted or rejected via equation 4. Finally,
could be tested based on equations 12 and 13, as a final judgment on the adequacy of the campaign design *G*_{j}. This entire process is then repeated for all eight field campaigns, beginning with the *N*_{c}=250×*N*_{Y} realizations of *τ*^{c}.

## 4 Case Study Results and Discussion

This section presents the results of the case study described in section 3. The results are focused on the effects of the critical value of travel time, measurement configurations, and parametric uncertainty, both *a priori* and *a posteriori*. The results from the case study are shown in Figures 6-8, and 9. For the reasons described above, we chose to focus on the behavior of
(equations 5 and 11) and
(equation 14), though analogous conclusions could have been made regarding
and
(equations 6 and 15, respectively) as well.

Figures 6a, 6c, and 6e show the resulting values of ⟨*I*⟩ (equation 9),
, and
plotted against *τ*_{crit} for a single measurement configuration (*G*_{1A}) and all three prior information scenarios. The vertical axis here is on a log scale in order to focus attention on the low-probability regions. Recalling the definition of *I* given by equation 17, we see that ⟨*I*⟩ as a function of *τ*_{crit} coincides with the cumulative distribution function of *τ*. Recalling the definition of
(equation 14), we see that
is the quotient of the other two quantities. Figures 6b, 6d, and 6f show the standard deviation of these variables plotted against *τ*_{crit}. Figure 7 shows all eight measurement configurations for comparison but focuses on only
for clarity. The left-hand column shows results from the four campaigns with measurements focused along the travel path, while the right-hand column shows the campaigns with the spread out measurements, aligning with the two columns in Figure 5. The horizontal line denoted *α* indicates the level of significance 0.05. Values of
exceeding this value indicate that the measurement campaign
would be rejected via equation 12, deeming *G* inadequate. In both Figures 6 and 7, *τ*_{crit} is nondimensionalized by the travel path length *L* and expected value of velocity ⟨*v*⟩. Figures 8 and 9 show the root mean square error (RMSE) of the estimates for SRF parameters *μ*_{Y}, *σ*_{Y}, and *I*_{Y} for all eight measurement campaigns for Scenarios 2 and 3.

### 4.1 Effect of *τ*_{crit}

The first thing we notice is that, regardless of the quantity or spatial configuration of measurements,
and
were highly sensitive to *τ*_{crit}. In turn, whether or not
could be rejected and *G* deemed adequate was dependent on *τ*_{crit}. For both large and small values of *τ*_{crit}, we found
approaching zero for all measurement configurations, indicating a very low occurrence of a Type I error in these regions. To understand why this was the case, we examined the value of ⟨*I*⟩ for these regions, shown in Figure 6. Recalling that ⟨*I*⟩ as a function of *τ*_{crit} corresponds directly to the cumulative distribution function of the arrival time, a Type I error was very unlikely for relatively small values of *τ*_{crit} due to the low probability of
being true in this region. On the other hand, for relatively large values of *τ*_{crit}, a Type I error was unlikely because it was easier to predict the relatively likely event of the arrival time being smaller than the large *τ*_{crit}. Where a Type I error was more likely, then was where
is somewhat likely to be true, but more difficult to predict correctly—the intermediate portion of the cumulative distribution function of the arrival time.

Recalling the definition in equation 14, the behavior of
and ⟨*I*⟩ with varying *τ*_{crit} explains the behavior of
. Figure 6 shows the relationship between ⟨*I*⟩,
, and
graphically and also highlights the difference between using occurrence probability and conditional probability to describe the effectiveness of the field data.

In the case when *τ*_{crit} was relatively small,
was very unlikely to be true, and if it was true, it became increasingly difficult to predict correctly, regardless of the amount of data. Thus,
could be considered a more informative metric to assess the benefit of the data than *P*_{α}. The relationship between these two variables and the probability of
being true in this case was shown near the vertical axes in Figure 6.

As demonstrated, the effectiveness of a field campaign and the success of decision making are highly dependent on the value of *τ*_{crit}. The practical implication of this result is that for effective, goal-oriented characterization, the field campaign should be tailored not only to the EPM of concern but also to the critical value of this EPM on which decision making depends.

### 4.2 Effect of Measurement Configurations

To compare the overall effectiveness of measurement campaigns, we examine both their performance in inferring SRF parameters (Figure 1, arrow A) and enabling successful decision making, that is, their resulting (Figure 1, arrow B). Figures 8 and 9 show the RMSE resulting from all of the SRF parameter estimates for all baseline fields for all measurement configurations. We find, unsurprisingly, that the parameter estimates improve with the increasing quantity of measurements. For any given number of measurements, the error in the parameter estimates was less for the measurements spread throughout the domain than for the measurements focused along the travel path.

As mentioned before, the focus of goal-oriented design is the probability of an incorrect decision. To this end, we use
for comparison, shown in Figure 7. Focusing for the moment on the prior information of Scenario 2, we notice two clear patterns: (1) Larger quantities of measurements were adequate (
is rejected) for a greater range of *τ*_{crit}, and (2) given a specific quantity of measurements, the configuration with measurements focused along the travel path (A) outperformed the configuration with measurements spread throughout the domain, despite the poorer performance of estimating the SRF parameters.

While summarizing a field campaign design in terms of a possible rejection of is a useful tool for managers and practitioners, a more thorough description of the performance of a campaign can be provided by analyzing , which indicates the probability that a Type I error will occur and not just its relation to a threshold probability. For Scenario 1, there was little difference between the behavior of the different configurations, stemming from the unrealistic assumption that the SRF parameters were known deterministically. For the relatively uninformative prior information (scenario 3), we found the same patterns as that for Scenario 2—an increasing quantity of measurements improved performance, and the measurements focused along the travel path were better for predicting earlier arrivals, despite the poorer performance of estimating the SRF parameters.

Again, the practical implications of these results emphasize the need for goal-oriented characterization design. Designing field campaign strategies to optimize performance when estimating SRF parameters is clearly not the best approach, as it was shown that improved parameter estimates do not necessarily indicate improved decision-making performance. In addition to designing field campaigns tailored to predicting a specified EPM (e.g., arrival time), it is also necessary to take into consideration the critical value of this EPM, which is the threshold for good decision making. Relating again to Figure 1, these results demonstrate that focusing only on arrow A can hinder arrow B, which is important when considering the goal of successful environmental and water resources management.

### 4.3 Effect of Parametric Uncertainty

To explore the effect of parametric uncertainty, we compare across scenarios the values presented in Figures 6 and 7. Initially, it seems counterintuitive that the more informative priors resulted in a higher probabilities of Type I errors, an effect which is explored in detail in this subsection.

Consider a baseline field where a relatively early arrival occurred, although not too early to predict (i.e., roughly in the range 0.5≤*τ*⟨*v*⟩/*L*≤0.9). In this case, field measurements having above-average conductivity are more likely to be observed. With greater prior parametric uncertainty, it becomes more likely to predict this early arrival due to the increased influence of the measurements. In Scenario 1, a group of high-conductivity measurements will cause the conditional cumulative distribution function (CDF) of the arrival time to diverge only slightly from the unconditional CDF. This, in turn, decreases the chance of correctly predicting the early arrival, even when the data suggest early arrival. On the contrary, in Scenarios 2 and 3, the high-conductivity measurements exhibit a greater impact on the conditional CDFs, enabling us to predict the early arrival when the data suggest so.

On the other hand, we can consider a baseline field where the arrival time is comparable to *L*/⟨*v*⟩ or greater in scale. In this case, the arrival time can be captured well by the mean behavior of the arrival time distribution, which can be estimated better with an informative prior. Thus, this is why with increasing *τ*_{crit}⟨*v*⟩/*L*, we observed decreasing
in all the scenarios, where the severity of the decrease is inversely correlated to the informativeness of the prior.

The effects described in this subsection should be considered specific to our specific case study, in which relatively simple geostatistical and physical models were used. Future research could investigate how this effect may change when more sophisticated spatial models, inversion methods, and forward models are used.

### 4.4 Uncertainty in the Results

To get a sense of the uncertainty in the results, we can look at the standard deviations of
, ⟨*I*⟩, and
, as shown on the right-hand side of Figure 6. What we see is that the standard deviations of both
and ⟨*I*⟩ were maximal near the mean arrival time (i.e., when *τ*_{crit}⟨*v*⟩/*L* was near one). The reason for this was that the majority of the actual arrival times was near this value, making it more difficult to predict the binary outcome given by equation 17. On the other hand, the standard deviation of
was at its peak when
, meaning that the conditional error probability was most uncertain when *τ*_{crit} was about half of the arrival time's expected value. The practical implication of this result is that even before the simulation is executed, *τ*_{crit}, ⟨*v*⟩, and *L* can provide a rough idea of how difficult the predictions might be. If *τ*_{crit}⟨*v*⟩/*L* is close to zero, then it can be reasoned that
is unlikely to be true; however, if it is true, it will be difficult to detect. Conversely, if *τ*_{crit}⟨*v*⟩/*L* is much greater than one, it can be reasoned that
is likely to be true and that it will be easy to detect.

## 5 On the Limits of Modeling Uncertainty

This paper provides a methodology for making decisions under conditions of uncertainty. The inputs required consist primarily of statistical models (in the form of statistical distributions, probability density functions (pdfs)) used for uncertainty. As “uncertainty” is a somewhat ambiguous term, it is important that we define the various types of uncertainty that can, and should, be addressed using RDADE, as well as the types that require a different approach. In this way, we delineate RDADEs usability.

The two major categories of uncertainty are aleatory uncertainty and epistemic uncertainty. Extensive discussions on these two categories can found in Beven (2016) and Blöschl et al. (2013). In addition, somewhat more suggestive definitions of uncertainty are provided by Di Baldassarre et al. (2016) and Rubin et al. (2018), who discuss differences between “known unknowns” and “unknown unknowns,” standing for aleatory uncertainty and epistemic uncertainty, respectively. Further, in Di Baldassarre et al. (2016), the authors included a third category—“wrong assumptions”—meaning “things we think we know but we actually do not know.” This third category obviously overlaps, to some degree, with the other two, but it is important to mention this third category as we map the universe of uncertainty.

What is covered by RDADE? Simply stated, any element of the system under investigation that can be formulated as a statistical model can be analyzed with RDADE. Uusitalo et al. (2015) provide a detailed list of such elements, including inherent randomness, measurement error, and natural variations, which all fall nicely under the “known unknown” category. As also stated in Uusitalo et al. (2015), “Uncertainty about the cause-and-effect relationship, is often very difficult to characterize.” Thus, RDADE covers situations of known unknowns. However, RDADE is able to do more, as it works with unknown unknowns, as well. Presumably, there would be no pdf for unknown unknowns. However, this situation can partly be addressed by developing alternative conceptual models for any of the system elements, which can then be integrated into RDADE using a Bayesian modeling approach (BMA; cf. Hoeting et al., 1999). GLUE (Beven & Binley, 1992) could also be considered in this context. Thus, instead of a single model used to compute the statistics, for example, as needed in equations 4-9, or 12, we can use multiple conceptual models.

There is a substantial body of evidence suggesting that BMA provides better-than-average predicting capabilities (Hoeting et al., 1999; Madigan & Raftery, 1994; Madigan et al., 1996). What is particularly appealing in this context is the consideration of “wild cards” (cf. Wardekker et al., 2010) by formulating them as conceptual models. Bringing multiple, alternative conceptual models into the RDADE analysis, including so-called “wild ones”, should not be interpreted to mean that the “true” model is necessarily one of a given class of alternatives (Leamer, 1983). Nonetheless, as shown by Leamer (pp. 315–316), such an approach “… asymptotically produce[s] an estimated density closest to the true density,” which is what is required for RDADEs application.

As mentioned above, the third category that Di Baldassarre et al. (2016) discussed is wrong assumptions. To qualify for RDADE, we must ask whether we can put a pdf on (e.g., associate a probabilistic model with) “wrong assumptions.” To some degree, the answer is yes; we are able to accomplish this by using BMA, in combination with the multiple or wildcard scenarios stated above. Ideally, this approach would reduce the impact of dominant, “wrong assumptions”-based conceptual models, for example, by exploring ex-situ information, such as geologically similar sites, (cf. Chang & Rubin, 2019; Cucchi et al., 2019; Li et al., 2018). However, as stated in Uusitalo et al. (2015), “… caution is advised … the [alternative conceptual] model parameters have been fitted to predict the observed data well, so depending on which kinds of scenarios are tested, it may only be natural that the model predictions are relatively similar.”

Hence, the “unknown unknowns” situation—right or wrong assumptions—cannot be completely and assuredly addressed because it covers situations that never happened or managed to escape the radar screens, so to speak, so there is always a chance of a low-probability, hard-to-predict event, wilder than any wild card. Such low-probability events could produce either minor consequences or huge ones—the latter often referred to as “Black Swans,” following Taleb (2007). Black swan events are unexpected, large-scale disasters. Black Swan situations cannot be covered by RDADE, or, for that matter, by any other probability-based, risk analysis methodology. The reason is, any scenario that could be conceptualized is, by definition, no longer a Black Swan. Here we wish to depart from a somewhat optimistic approach about being able to cover all aspects of uncertainty, by bringing Black Swans into the picture.

Black Swans are low-probability events by the fact that they are so rare—possibly even without precedent—and, because they are so rare, they are not included in uncertainty models. Even if one were to attempt to recognize them, they would exist somewhere at the extremes: the hard-if-not-impossible-to-characterize, low-probability tails of the distribution. Because the tails of the distribution are hard to characterize, this calls into question any risk assessment analysis or management decision that would use such probabilities. Taleb (2007) has discussed this situation, stating that “it is much easier to deal with Black Swan problems if we focus on robustness to errors rather than improving predictions.” Referring to Black Swan situations, Wardekker et al. (2010) demonstrated that “a resilience approach makes the system less prone to disturbances, enables quick and flexible responses, and is better capable of dealing with surprises than traditional predictive approaches.” Blöschl et al. (2013) extended the discussion in Taleb (2007) to flood risk situations, by suggesting that the vulnerability of a system can be reduced by structural changes and emergency planning. Moreover, Merz et al. (2015) suggested that “decentralization, diversification, and redundancy can further enhance system robustness and adaptivity.” Moving underground, in the case of nuclear waste repositories, for example, or any other possible high-risk polluter, it may be more beneficial to invest in early warning monitoring systems and other precautionary measures than trying to improve the predictive accuracy of what might happen tens of thousands of years into the future. To summarize, RDADE should be used primarily to analyze situations that could be reasonably modeled with probabilities. For the rest, planners should focus on reducing system vulnerability.

## 6 Summary and Conclusions

In this paper, we introduced RDADE, a framework for rational, risk-based decision making in water resources management, policy, regulation, and field campaign design. The framework enables the straightforward management of uncertainty and risk. Using RDADE, regulators and policymakers are able to define—in whatever way deemed appropriate—an acceptable level of risk in management decisions, while managers and practitioners are able to simply demonstrate that a decision is *defensible*, even if it is not ultimately *correct*.

The RDADE framework is general: It can accommodate any number of hydrogeological, biological, geochemical, and so forth conceptual models and can be used with any type of field data acquisition and inverse modeling methods. RDADE itself does not design a field campaign; rather, that is left to practitioners with experience in field methods and local hydrogeological conditions. What the framework does accomplish is that it allows practitioners to take a proposed field campaign design and determine, probabilistically, whether or not the design will provide enough information for a defensible decision.

RDADE allows for the simple communication of uncertainty and risk. Instead of focusing on geostatistical model uncertainty or parametric uncertainty, or any other concept that may be unfamiliar to stakeholders outside of hydrologists, the framework allows for simple communication of uncertainty in the quantities directly relevant to decision making. While this simple communication of uncertainty and risk is no substitute for the transparent communication of data collected, models used, or assumptions made in the analysis, the benefit is the ability to summarize such communication with a simple description regarding levels of uncertainty and risk. In other words, it is easier to relate to chances (or probabilities) of making an error in the decision than to explain what estimation variance means in relation to environmental or societal impact. What RDADE enables is a communication of the type, “With budget *a*, we have a *b*% chance of error, but with a larger budget *c*, we can reduce the chance of error to *d*%.” This is also not a trivial concept but can be more easily understood and used as a means to shape discourse among stakeholders. Still, the ability to clearly state probabilities of success and failure does not completely solve any ambiguities. For a complete approach, there is a need to define legally binding probability standards (Rubin et al., 2018). Once probabilistic laws are set, an approach such as RDADE can be used to demonstrate compliance with such requirements.

In this work, we demonstrated the RDADE framework using a case study predicting contaminant arrival time in an aquifer. The emphasis of this paper was the demonstration of the framework, so for conciseness and simplicity, several simplifying assumptions were made, including 2-D flow, a Gaussian field with low variance, exponential variogram, and uniform prior distributions. These assumptions, of course, limit the applicability of the conclusions drawn regarding the spatial configuration of measurements to this scenario. However, as discussed, the assumptions made in the case study are not necessary for the general use of the framework. Going forward, research could utilize the RDADE framework to more closely examine the relationship between prior information, prediction uncertainty, and decision making in more realistic scenarios.

The case study showed that improved estimates of geostatistical parameters are not necessarily associated with improved water resources decision making, thus demonstrating the importance of designing field campaigns with the goal of making defensible management decisions, as opposed to optimal parameter estimates. It was also shown that the amount of field data necessary to make a decision must be determined on a case-by-case basis. The critical value of the EPM on which the decision depends, as well as the amount of prior information available about the site, can significantly affect the amount of field data that is necessary. This further highlights the importance of goal-oriented characterization design, which is important in light of the costs associated with site characterization approaches. The methods presented here can be utilized by managers to prevent overspending on unnecessary amounts of field data and can also ensure that measurements are strategically placed in order to ensure the maximum benefit of the data.

## Acknowledgements

All data used in this study were synthetically generated using the publicly available R package RandomFields, as cited. The baseline fields used in this study are available online (at https://doi.org/10.6078/D15Q4K). This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. The first author is grateful for the support provided by Helmholtz-UFZ during and after the time he spent there as a guest scientist. Additionally, the first author is grateful to Heather Savoy for her advice on various components of the computational and writing processes conducted for this project.