Robustness Metrics: How Are They Calculated, When Should They Be Used and Why Do They Give Different Results?
Abstract
Robustness is being used increasingly for decision analysis in relation to deep uncertainty and many metrics have been proposed for its quantification. Recent studies have shown that the application of different robustness metrics can result in different rankings of decision alternatives, but there has been little discussion of what potential causes for this might be. To shed some light on this issue, we present a unifying framework for the calculation of robustness metrics, which assists with understanding how robustness metrics work, when they should be used, and why they sometimes disagree. The framework categorizes the suitability of metrics to a decision‐maker based on (1) the decision‐context (i.e., the suitability of using absolute performance or regret), (2) the decision‐maker's preferred level of risk aversion, and (3) the decision‐maker's preference toward maximizing performance, minimizing variance, or some higher‐order moment. This article also introduces a conceptual framework describing when relative robustness values of decision alternatives obtained using different metrics are likely to agree and disagree. This is used as a measure of how “stable” the ranking of decision alternatives is when determined using different robustness metrics. The framework is tested on three case studies, including water supply augmentation in Adelaide, Australia, the operation of a multipurpose regulated lake in Italy, and flood protection for a hypothetical river based on a reach of the river Rhine in the Netherlands. The proposed conceptual framework is confirmed by the case study results, providing insight into the reasons for disagreements between rankings obtained using different robustness metrics.
1 Introduction
Uncertainty has long been considered an important facet of environmental decision‐making. This uncertainty arises from natural variability, as well as changes in system conditions over time (Maier et al., 2016). In the past, the latter have generally been represented by a “best guess” or “expected” future (Lempert et al., 2006). Consequently, much of the consideration of uncertainty was concerned with the impact of localized uncertainty surrounding expected future conditions (Giuliani et al., 2016c; Morgan et al., 1990) and a realization of the value of information for reducing this localized uncertainty (Howard, 1966; Howard & Matheson, 2005). The consideration of localized uncertainty is reflected in the wide‐spread usage of performance metrics such as reliability, vulnerability, and resilience (Burn et al., 1991; Hashimoto et al., 1982a; Maier et al., 2001; Zongxue et al., 1998). However, as a result of climatic, technological, economic and sociopolitical changes, there has been a realization that it is no longer possible to determine a single best‐guess of how future conditions might change, especially when considering longer planning horizons (e.g., on the order of 70–100 years) (Döll & Romero‐Lankao, 2017; Grafton et al., 2016b; Guo et al., 2017; Maier et al., 2016).
In response, there has been increased focus on deep uncertainty, which is defined as the situation in which parties to a decision do not know, or cannot agree on, how the system under consideration, or parts thereof, work, how important the various outcomes of interest are, and/or what the relevant exogenous inputs to the system are and how they might change in the future (Kwakkel et al., 2010; Lempert, 2003; Maier et al., 2016; Walker et al., 2013). In such a situation, one can enumerate multiple plausible possibilities without being able to rank them in terms of likelihood (Döll & Romero‐Lankao, 2017; Kwakkel et al., 2010). This inability can be due to a lack of knowledge or data about the mechanism or functional relationships being studied. However, it can also arise because the various parties involved in the decision cannot come to an agreement. That is, under deep uncertainty, there is a variety of uncertain factors that jointly affect the consequences of a decision. These uncertain factors define different possible states of the world in a deterministic and set‐based manner (Ben‐Tal et al., 2009).
As pointed out by Maier et al. (2016), when dealing with deep uncertainty, system performance is generally measured using metrics that preference systems that perform well under a range of plausible conditions, which fall under the umbrella of robustness. It should be noted that while robustness metrics have been considered in different problem domains, such as water resources planning (Hashimoto et al., 1982b), dynamic chemical reaction models (Samsatli et al., 1998), timetable scheduling (Canon & Jeannot, 2007), and data center network service levels (Bilal et al., 2013) for some time, this has generally been in the context of perturbations centered on expected conditions, or local uncertainty, rather than deep uncertainty. In contrast, consideration of robustness metrics for quantifying system performance under deep uncertainty, which is the focus of this article, has only occurred relatively recently.
- Expected value metrics (Wald, 1950), which indicate an expected level of performance across a range of scenarios.
- Metrics of higher‐order moments, such as variance and skew (e.g., Kwakkel et al., 2016b), which provide information on how the expected level of performance varies across multiple scenarios.
- Regret‐based metrics (Savage, 1951), where the regret of a decision alternative is defined as the difference between the performance of the selected option for a particular plausible condition and the performance of the best possible option for that condition.
- Satisficing metrics (Simon, 1956), which calculate the range of scenarios that have acceptable performance relative to a threshold.
However, although these metrics all measure system performance over a set of future states of the world, they do so in different ways, making it difficult to assess how robust the performance of a system actually is. For example, these metrics reflect varying levels of risk aversion, and differences about what is meant by robustness. Is robustness about ensuring insensitivity to future developments, reducing regret, or avoiding very negative outcomes? This meta‐problem of deciding how to decide (Schneller & Sphicas, 1983) raises the following question: how robust is a robust solution?
Studies in environmental literature discussing this question have been receiving some attention in recent years. Lempert and Collins (2007) compared optimal expected utility, the precautionary principle, and robust decision making using a regret based measure of robustness. They found that the three approaches generated similar results, although some approaches may be more appropriate for different audiences and under different circumstances. Herman et al. (2015) compared two regret‐based metrics and two satisficing metrics, showing how the choice of metric could significantly affect the choice of decision alternative. However, they found that the two regret‐based metrics tended to agree with each other.
Drouet et al. (2015) contrasted maximin, subjective expected utility, and maxmin expected utility, while Roach et al. (2016) compared two satisficing metrics (info‐gap decision theory and Starr's domain criterion). Both studies found that the choice of metric can greatly influence the trade‐offs for decision‐makers. The former highlighted the importance of understanding the preferences of the decision‐maker, while the latter acknowledged the need for studies on more complex systems and the need to compare and combine metrics. Giuliani and Castelletti (2016) compared the classic decision theoretic metrics maximin, maximax, Hurwicz optimism‐pessimism rule, minimax regret, and Laplace's principle of insufficient reason, further showing that it is very important to select a metric that is appropriate for the decision‐maker's preferences to avoid underestimation of system performance. Kwakkel et al. (2016b) compared five robustness metrics and highlighted the importance of using a combination of metrics to see not just the expected value of performance, but also the dispersion of performance around the mean.
A common conclusion across this work is that different robustness metrics reflect different aspects of what makes a choice robust. This not only makes it difficult to assess the absolute “robustness” of an alternative, but also makes it difficult to determine whether a particular alternative is more robust than another. This leads to confusion for decision‐makers, as they have no means of comparing the robustness values and rankings of different decision alternatives obtained using different robustness metrics in an objective fashion.
To address this shortcoming, the objectives of this article are to (1) introduce a unified framework for the calculation of a wide range of robustness metrics, enabling the robustness values obtained from different metrics to be compared in an objective fashion, (2) introduce a taxonomy of robustness metrics and discuss how this can be used to assist with deciding which robustness metric is most appropriate, providing guidance for decision makers as to which robustness metric should be used in their particular context, (3) introduce a conceptual framework for conditions under which different robustness metrics result in different decisions, or how stable (“robust”) the ranking of an alternative is when different robustness metrics are used, providing further guidance to decision‐makers, and (4) test the conceptual framework from (3) on three case studies that provide a variety of decision contexts, objectives, scenario types and decision alternatives. The selected case studies are: the water supply augmentation in the southern Adelaide region in Australia (Paton et al., 2013), the operation of Lake Como in Italy for flood protection and water supply purposes (Giuliani & Castelletti, 2016), and flood protection for a hypothetical river called the Waas, which is based on a river reach of the Rhine delta in the Netherlands (Haasnoot et al., 2012).
The remainder of this article is organized as follows. In Section 2, the unified framework for the calculation of robustness metrics is introduced and a variety of robustness metrics are categorized according to this framework. A taxonomy based on these categories is provided in Section 3, as well as a summary of how the robustness metrics are classified in accordance with this taxonomy, the way they consider future uncertainties and the relative level of risk aversion they exhibit. In Section 4 an analysis of the conditions under which robustness metrics agree or disagree with other robustness metrics is given, as well as a conceptual framework categorizing the relative degree of agreement of the rankings of decision alternatives obtained using different robustness metrics based on the properties of the metric and the performance of the system under consideration. The three case studies are introduced in Section 5, as well as a summary of the similarities and differences between them. The robustness of different decision alternatives for the three case studies is calculated in Section 6 using a range of robustness metrics and the results are presented and discussed in terms of the stability of the ranking of different decision alternatives when different robustness metrics are used. Finally, conclusions are presented in Section 7.
2 How Are Robustness Metrics Calculated?
Even though there are many different robustness metrics, irrespective of which metric is used, their calculation generally requires the specification of (1) the decision alternatives (e.g., policy options, designs, solutions, management plans) for which robustness is to be calculated, (2) the outcome of interest (performance metric) of the decision alternatives (e.g., cost, reliability), and (3) the plausible future conditions (scenarios) over which the outcomes of interest/performance of the decision alternatives is to be evaluated. These three components of robustness are illustrated in Figure 1.

Robustness is generally calculated for a given decision alternative, xi, across a given set of future scenarios S = {s1, s2, …, sn} using a particular performance metric f(·). Consequently, the calculation of robustness using a particular metric corresponds to the transformation of the performance of a set of decision alternatives over different scenarios, f(xi, S) = {f(xi, s1), f(xi, s2), …, f(xi, sn)} to the robustness R(xi, S) of these decision alternatives over this set of scenarios. Although different robustness metrics achieve this transformation in different ways, a unifying framework for the calculation of different robustness metrics can be introduced by representing the overall transformation of f(xi, S) into R(xi, S) by three separate transformations: performance value transformation (T1), scenario subset selection (T2), and robustness metric calculation (T3), as shown in Figure 2. Details of these transformations for a range of commonly used robustness metrics are given in Table 1 and their mathematical implementations are given in Supporting Information S1.

| Metric | Original reference | T1: Performance value transformation | T2: Scenario subset selection | T3: Robustness metric calculation |
|---|---|---|---|---|
| Maximin | Wald (1950) | Identity | Worst‐case | Identity |
| Maximax | Wald (1950) | Identity | Best‐case | Identity |
| Hurwicz optimism‐pessimism rule | Hurwicz (1953) | Identity | Worst‐ and best‐cases | Weighted mean |
| Laplace's principle of insufficient reason | Laplace and Simon (1951) | Identity | All | Mean |
| Minimax regret | Savage (1951) and Giuliani and Castelletti (2016) | Regret from best decision alternative | Worst‐case | Identity |
| 90th percentile minimax regret | Savage (1951) | Regret from best decision alternative | 90th percentile | Identity |
| Mean‐variance | Hamarat et al. (2014) | Identity | All | Mean‐variance |
| Undesirable deviations | Kwakkel et al. (2016b) | Regret from median performance | Worst‐half | Sum |
| Percentile‐based skewness | Voudouris et al. (2014) and Kwakkel et al. (2016b)aa Kwakkel et al. (2016b) adapted some metrics from Voudouris et al. (2014). |
Identity | 10th, 50th, and 90th percentiles | Skew |
| Percentile‐based peakedness | Voudouris et al. (2014) and Kwakkel et al. (2016b)aa Kwakkel et al. (2016b) adapted some metrics from Voudouris et al. (2014). |
Identity | 10th, 25th, 75th and 90th percentiles | Kurtosis |
| Starr's domain criterion | Starr (1963) and Schneller and Sphicas (1983) | Satisfaction of constraints | All | Mean |
The performance value transformation (T1) converts the performance values f(xi, S) into the type of information f′(xi, S) used in the calculation of the robustness metric R(xi, S). For some robustness metrics, the absolute performance values (e.g., cost, reliability) are used, in which case T1 corresponds to the identity transform (i.e., the performance values are not changed). For other robustness metrics, the absolute system performance values are transformed to values that either measure the regret that results from selecting a particular decision alternative rather than the one that performs best had a particular future actually occurred or indicate whether the selection of a decision alternative results in satisfactory system performance or not (i.e., whether required system constraints have been satisfied or not).
The scenario subset selection transformation (T2) involves determining which values of f′(xi, S) to use in the robustness metric calculation (T3) (i.e., f′(xi, S′) ⊆ f′(xi, S)), which is akin to selecting a subset of the available scenarios over which system performance is to be assessed. This reflects a particular degree of risk aversion, where consideration of more extreme scenarios in the calculation of a robustness metric corresponds to a higher degree of risk aversion and vice versa. As can be seen from Table 1, which scenarios are considered in the robustness calculation is highly variable between different metrics.
The third transformation (T3) involves the calculation of the actual robustness metric based on transformed system performance values (T1) for the selected scenarios (T2), which corresponds to the transformation of f′(xi, S′) to a single robustness value, R(xi, S). This equates to an identity transform in cases where only a single scenario is selected in T2, as there is only a single transformed performance value, which automatically becomes the robustness value. However, in cases where there are transformed performance values for multiple scenarios, these have to be transformed into a single value by means of calculating statistical moments of these values, such as the mean, standard deviation, skewness or kurtosis.
3 When Should Different Robustness Metrics Be Used?
In this section, a taxonomy of different robustness metrics is given in accordance with the three transformations introduced in Section 2. A summary of the three transformations, as well as the relative level of risk aversion, is provided in Section 3.4.
3.1 Transformation 1 (T1): Performance Value Transformation
A categorization of different robustness metrics in accordance with the performance value transformation (T1) is given in Table 2. As can be seen, the categorization is based on (1) whether calculation of a robustness metric is based on the absolute performance of a particular decision alternative or the performance of a decision alternative relative to that of another decision alternative or a benchmark; and (2) whether a robustness metric provides an indication of actual system performance or whether system performance is satisfactory compared with a pre‐specified performance threshold.
| Robustness calculation based on relative performance values | Robustness calculation based on absolute performance values | |
|---|---|---|
| Indication of whether system performance is satisfactory or not |
|
|
| Indication of actual system performance |
|
|
- Note that brackets around a metric indicate that the metrics is considered unsuitable and is not considered in the following analysis.
- a Robustness calculated explicitly, but based on deviations from an expected scenario.
- b Robustness not calculated explicitly.
Many of the classic decision analytic robustness metrics belong to the bottom‐right hand quadrant of Table 2, including the maximax and maximin criteria, Hurwicz's optimism‐pessimism rule and Laplace's principle of insufficient reason, as well as well more recently developed metrics such as the mean‐variance criterion, percentile based skewness and percentile‐based peakedness. These metrics utilize information about the absolute performance (e.g., cost, reliability) of a particular decision alternative in a particular scenario. Consequently, values of f(xi, S′) consist of these performance values, and robust decision alternatives are those that maximize system performance across the scenarios. The difference between these metrics is which values of the distribution of performance values over the different scenarios f(xi, S) they use in the robustness calculation (i.e., scenario subset selection (T2)) and how these values are combined into a single value of R (i.e., robustness metric calculation (T3)), as discussed in Sections 3.2 and 3.3.
Metrics in the bottom‐left quadrant of Table 2 are calculated in a similar manner to those in the bottom‐right quadrant, except that they use information about the performance of a decision alternative relative to that of other decision alternatives or a benchmark, and therefore generally express robustness in the form of regret or other measures of deviation. Consequently, the resulting values of f′(xi, S) consist of the differences between the actual performance of a decision alternative (e.g., cost, reliability) and that of another decision alternative or a benchmark. A robust decision alternative is the one that minimizes the maximum regret across scenarios (e.g., Herman et al., 2015). Alternative metrics that are based on the relative performance of decision alternatives use some type of baseline performance for a given scenario instead of the performance of the best decision alternative (Herman et al., 2015; Kasprzyk et al., 2013; Kwakkel et al., 2016b; Lempert & Collins, 2007; Popper et al., 2009).
Metrics in the top right quadrant of Table 2 measure robustness relative to a threshold or constraint in order to determine whether a decision alternative performs satisfactorily under different scenarios, and are commonly referred to as satisficing metrics. These metrics build on the work of Simon (1956), who pointed out that decision makers often look for a decision that meets one or more requirements (i.e., performance constraints) under a range of scenarios, rather than determining optimal system performance. Therefore, values of f′(xi, S) consist of information on the scenarios for which the decision alternatives under consideration meet a minimum performance threshold and the larger the number of these scenarios, the more robust a decision alternative. A well‐known example of this is the domain criterion, which focuses on the volume of the total space of plausible futures where a given performance threshold is met; the larger this space, the more robust the decision alternative. Often, this is simplified to looking at the fraction of scenarios where the performance threshold is met (e.g., Beh et al. 2015a; Herman et al., 2015; Culley et al., 2016), as scenarios provide a discrete representation of the space of plausible futures.
Satisficing metrics can also be based on the idea of a radius of stability, which has made a recent resurgence under the label of info‐gap decision theory (Ben‐Haim, 2004; Herman et al., 2015). Here, one identifies the uncertainty horizon over which a given decision alternative performs satisfactorily. The uncertainty horizon α is the distance from a pre‐specified reference scenario to the first scenario in which the pre‐specified performance threshold is no longer met (Hall et al., 2012; Korteling et al., 2012). However, as these metrics are based on deviations from an expected future scenario, they only assess robustness locally and are therefore not suited to dealing with deep uncertainty (Maier et al., 2016). These metrics also assume that the uncertainty increases at the same rate for all uncertain factors when calculating the uncertainty horizon on a set of axes. Consequently, they are shown in parentheses in Table 2 and will not be considered further in this article.
Metrics in the top‐left quadrant of Table 2 base robustness calculation on relative performance values and indicate whether these values result in satisfactory system performance or not. Methods belonging to this category are generally based on the concept of stability. However, in contrast to the stability‐based methods in the top‐right quadrant of Table 2, these methods assess stability of a decision alternative relative to that of another by identifying crossover points (Guillaume et al., 2016) at which the performance of one decision alternative becomes preferable to that of another and identifying the regions of the scenario space in which a given decision alternative is preferred over another. Methods belonging to this category include the management option rank equivalence (MORE) (Ravalico et al., 2010) and Pareto optimal management option rank equivalence (POMORE) (Ravalico et al., 2009) methods, as well as decision scaling (Brown et al., 2012; Poff et al., 2015). However, as these methods do not quantify robustness explicitly, they are shown in parentheses in Table 2 and will not be considered further in this article.
3.2 Transformation 2 (T2): Scenario Subset Selection
A categorization of different robustness metrics in accordance with the scenario subset selection transformation (T2) is given in Table 3. As can be seen, the categorization is based on whether all or a subset of the values of f′(xi, S) are used in the calculation of the robustness metric. If a subset of values is used, this can consist of a single value or a number of values. As shown in Table 3, Laplace's principle of insufficient reason, the mean‐variance metric and Starr's domain criterion use the full set of scenarios S and thus S′ = S. In contrast, the maximin, maximax, minimax regret and 90th percentile minimax regret metrics only use a single value from S to form S′. The metrics that use a number of selected scenarios S′ in the calculation of R include Hurwicz's optimism‐pessimism rule, undesirable deviations, percentile‐based skewness and percentile‐based peakedness.
| Robustness metric | Scenarios from S used to form the subset S′ | ||
|---|---|---|---|
| Subset | All | ||
| Single | Number | ||
| Maximin | Worst‐case | ||
| Maximax | Best‐case | ||
| Hurwicz optimism‐pessimism rule | Best‐ and worst‐case | ||
| Laplace's principle of insufficient reason | All | ||
| Minimax regret | Worst‐case | ||
| 90th percentile minimax regret | 90th percentile | ||
| Mean‐variance | All | ||
| Undesirable deviations | All performance values worse than the 50th percentile | ||
| Percentile‐based skewness | 10th, 50th and 90th percentiles | ||
| Percentile‐based peakedness | 10th, 25th, 75th and 90th percentiles | ||
| Starr's domain criterion | All | ||
Which scenarios from S are selected has a significant influence on the relative level of inherent risk aversion of a robustness metric, as shown in Figure 3. For example, the maximax metric has a very low inherent level of risk aversion, as its calculation is only based on the best performance over all scenarios considered (Table 3). In contrast, the maximin metric has a very high level of intrinsic risk aversion, as its calculation is only based on the worst performance over all scenarios considered (Table 3), leading to a very conservative solution (Bertsimas & Sim, 2004). Similarly, the minimax regret metric assumes that the selected decision alternative will have the largest regret possible, as its calculation is based on the worst‐case relative performance (Table 3). The other metrics fit somewhere in‐between these extremes of low and high levels of intrinsic risk aversion, as shown in Figure 3 and explained below.

Calculation of the metrics in the middle of Figure 3 is based on S′ that covers all regions of S, thereby providing a balanced perspective, corresponding to neither a low or high level of intrinsic risk aversion. Some of these metrics use all scenarios (S), such as Laplace's principle of insufficient reason and the mean‐variance metric, whereas others are based on a subset of percentiles S′ that sample the distribution of S in a balanced way, such as percentile‐based skewness, which uses the 10th, 50th and 90th percentiles, and percentile‐based peakedness, which uses the 10th, 25th, 75th, and 90th percentiles (Table 3). Intuitively, Hurwicz's optimism‐pessimism rule should also belong to this category, as it utilizes both the best and worst values of f(xi, S). However, as these values are weighted in the calculation of R using user‐defined values (see Section 3.3), the resulting robustness values can correspond to either low to high levels of intrinsic risk aversion, depending on the selected weightings, as indicated in Figure 3. Similarly, robustness values obtained using Starr's domain criterion could range from low to high, depending on the value of the user‐selected minimum performance threshold. For example, if this threshold corresponds to a very high level of performance, the resultant robustness value will correspond to a very high level of intrinsic risk aversion and vice versa.
The undesirable deviations and 90th percentile minimax metrics also use a subset S′, however, these scenarios do not cover all regions of this S and are therefore less balanced. The undesirable deviations metric considers regret from the median for scenarios for which values of f(xi, S) are less than the median, resulting in robustness values that have a higher level of intrinsic risk aversion than those obtained using metrics that used information from all regions of the distribution (Table 3). The 90th percentile minimax regret metric corresponds to an even greater level of intrinsic risk aversion, as it is based on a single value that is close to the worst case (90th percentile—see Table 3).
3.3 Transformation 3 (T3): Robustness Metric Calculation
A categorization of different robustness metrics in accordance with the final robustness metric calculation (T3) is given in Table 4. As can be seen, for some metrics, such as the maximin, maximax, minimax regret and 90th percentile minimax regret metrics, f′(xi, S′) and R(xi, S) are identical (i.e., the robustness metric calculation corresponds to the identity transformation). This is because for these metrics, S′ consists of a single scenario and there is no need to combine a number of values in order to arrive at a single value of robustness. However, for the remaining metrics, for which S′ contains at least two values, some sort of transformation is required. Metrics that are based on the mean or sum of f′(xi, S′), such as Laplace's principle of insufficient reason, mean‐variance and undesirable deviations, effectively assign an equal weighting to different scenarios and then suggest that the best decision is the one with the best mean performance, producing an expected value of performance. In contrast, in Hurwicz's optimism‐pessimism rule, the user can select the relative weighting of the two scenarios (low and high levels of risk aversion) considered, as mentioned in Section 3.2.
| Robustness metric | Robustness metric calculation | ||||||
|---|---|---|---|---|---|---|---|
| None | Sum | Mean | Weighted mean | Variance | Skew | Kurtosis | |
| Maximin | √ | ||||||
| Maximax | √ | ||||||
| Hurwicz optimism‐pessimism rule | √ | ||||||
| Laplace's principle of insufficient reason | √ | ||||||
| Minimax regret | √ | ||||||
| 90th percentile minimax regret | √ | ||||||
| Mean‐variance | √ | √ | |||||
| Undesirable deviations | √ | ||||||
| Percentile‐based skewness | √ | ||||||
| Percentile‐based peakedness | √ | ||||||
| Starr's domain criterion | |||||||
Alternatively, some metrics consider aspects of the variability of f′(xi, S′). For example, the mean‐variance metric attempts to balance the mean and variability of the performance of a decision alternative over different scenarios. However, a disadvantage of considering a combination of the mean and variance is that the resultant metric is not always monotonically increasing (Ray et al., 2013). Moreover, when considering variance, good and bad deviations from the mean are treated equally (Takriti & Ahmed, 2004). The undesirable deviations metric overcomes this limitation, while still providing a measure of variability. Other metrics are focused on different attributes of f′(xi, S′), such as the skewness and kurtosis.
3.4 Summary of Categorization of Robustness Metrics
The complete categorization of the commonly used robustness metrics considered in this article in accordance with the three transformations (performance value transformation (T1) (Table 2), scenario subset selection (T2) (Table 3) and robustness metric calculation (T3) (Table 4)), as well as the relative level of risk aversion that is associated with T2 (Figure 3), is given in Table 5. It is hoped that this can provide some guidance to decision‐makers in relation to which robustness metric is appropriate for their decision context.
| Robustness metric | T1: Performance value transformation | T2: Scenario subset selection | T3: Robustness metric calculation | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Optimize system performance | Satisfy constraints | ||||||||||||
| Absolute values (no transform) | Relative values | Absolute values (performance meets constraints) | Relative values | Single value | Subset of values | All values | Low (☆) to high (☆☆☆☆☆) level of risk aversion | None, sum, or mean | Weighted mean | Variance | Skew | Kurtosis (peakedness) | |
| Maximin | √ | √ | ☆☆☆☆☆ | √ | |||||||||
| Maximax | √ | √ | ☆ | √ | |||||||||
| Hurwicz optimism‐pessimism rule | √ | 2 | ☆ to ☆☆☆☆☆bb Hurwicz optimism‐pessimism rule has a parameter (selected by the decision‐maker) to determine the relative level of risk aversion. |
√ | |||||||||
| Laplace's principle of insufficient reason | √ | √ | ☆☆☆ | √ | |||||||||
| Minimax regret | √ | √ | ☆☆☆☆☆ | √ | |||||||||
| 90th percentile minimax regret | √ | √ | ☆☆☆☆ | √ | |||||||||
| Mean‐variance | √ | √ | ☆☆☆ | √ | √ | ||||||||
| Undesirable deviations | √ | Vaa V = variable. |
☆☆☆☆ | √ | |||||||||
| Percentile‐based skewness | √ | 3 | ☆☆☆ | √ | |||||||||
| Percentile‐based peakedness | √ | 4 | ☆☆☆ | √ | |||||||||
| Starr's domain criterion | √ | Vaa V = variable. |
☆ to ☆☆☆☆☆cc This is dependent on the minimum performance threshold selected by the decision‐maker. |
√ | |||||||||
- a V = variable.
- b Hurwicz optimism‐pessimism rule has a parameter (selected by the decision‐maker) to determine the relative level of risk aversion.
- c This is dependent on the minimum performance threshold selected by the decision‐maker.
In relation to the performance value transformation (T1), which robustness metric is most appropriate depends on whether the performance value in question relates to the satisfaction of a system constraint or not, and is therefore a function of the properties of the system under consideration. For example, if the system is concerned with supplying water to a city, there is generally a hard constraint in terms of supply having to meet or exceeding demand, so that the city does not run out of water (Beh et al., 2017). The system performs satisfactorily if this demand is met and that is the primary concern of the decision‐maker. Alternatively, there might be a fixed budget for stream restoration activities, which also provides a constraint. In this case, a solution alternative performs satisfactorily if its cost does not exceed the budget. For the above examples, where performance values correspond to determining whether constraints have been met or not, satisficing metrics, such as Starr's domain criterion, are most appropriate.
In contrast, if the performance value in question relates to optimizing system performance, metrics that use the identity or regret transforms would be most suitable. For example, for the water supply security case mentioned above, the objective might be to identify the cheapest solution alternative that enables supply to satisfy demand. However, there might also be concern in over‐investment in expensive water supply infrastructure that is not needed, in which case robustness metrics that apply a regret transformation might be most appropriate, as this would enable the degree of over‐investment to be minimized when applied to the cost performance value. For the stream restoration example, however, decision‐makers might simply be interested in maximizing ecological response for the given budget. In this case, robustness metrics that use the identity transform might be most appropriate when considering performance values related to ecological response.
In relation to scenario subset selection (T2), which robustness metric is most appropriate depends on a combination of the likely impact of system failure and the degree of risk aversion of the decision‐maker. In general, if the consequences of system failure are more severe, the degree of risk‐aversion adopted would be higher, resulting in the selection of robustness metrics that consider scenarios that are likely to have a more deleterious impact on system performance. For example, in the water supply security case, it is likely that robustness metrics that consider more extreme scenarios would be considered, as a city running out of water would most likely have severe consequences. In contrast, as the potential negative impacts for the stream restoration example are arguably less severe, robustness metrics that use a wider range or less severe scenarios might be considered. However, this also depends on the values and degree of risk aversion of the decision maker.
As far as the robustness value calculation (T3) goes, this is only applicable to metrics that consider more than one scenario, as discussed previously, and relates to the way performance values over the different scenarios are summarized. For example, if there is interest in the average performance of the system under consideration over the different scenarios selected in T2, such as the average cost for the water supply security example or the average ecological response for the stream restoration example, a robustness metric that sums or calculates the mean of these values should be considered. However, decision‐makers might also be interested in (1) the variability of system performance (e.g., cost, ecological response) over the selected scenarios, in which case robustness metrics based on variance should be used, (2) the degree to which the relative performance of different decision alternatives is different under more extreme scenarios, in which case robustness metrics based on skewness should be used, and/or (3) the degree of consistency in the performance of different decision alternatives over the scenarios considered, in which case robustness metrics based on kurtosis should be used.
4 When Do Robustness Metrics Disagree?
(1)
(2)
(3)
(4)
(5)Consequently, relative differences in robustness values obtained when different robustness metrics are used are a function of (1) the differences in the transformations (i.e., performance value transformation (T1), scenario subset selection (T2), robustness metric calculation (T3)) used in the calculation of Ra and Rb and (2) differences in the relative performance of decision alternatives x1 and x2 over the different scenarios considered. In general, ranking stability is greater if there is greater similarity in the three transformations for Ra and Rb and if there is greater consistency in the relative performance of x1 and x2 for the scenarios considered in the calculation of Ra and Rb, as shown in the conceptual representation in Figure 4. In fact, if the relative performance of two decision alternatives is the same under all scenarios, the relative ranking of these decision alternatives is stable, irrespective of which robustness metric is used.

4.1 Similar Transformations and Consistent Relative Performance
If the transformations used in the calculation of the robustness metrics are similar and the performance of the two decision alternatives considered is consistent across the scenarios, one would expect ranking stability to be very high (top‐right quadrant, Figure 4). For example, when minimax regret and 90th percentile minimax regret correspond to Ra and Rb, there is a high degree of similarity in the performance value transformation (T1), scenario subset selection (T2), and robustness metric calculation (T3) (y‐axis). For both metrics, the performance values are transformed to regret, S′ corresponds to a single scenario and there is no need to combine any values as part of the robustness metric calculation (T3), as there is only a single value of regret (Table 5). Similarly, there is a high degree of consistency in the relative performance values used for the calculation of Ra and Rb (x‐axis), as minimax regret uses the worst‐case scenario and 90th percentile minimax regret uses a scenario that almost corresponds to the worst case (Table 3). Consequently, one would expect the ranking of decision alternatives to be very stable when these two metrics are used.
4.2 Different Transformations and Inconsistent Relative Performance
Ranking stability is generally low if there are marked differences in the three transformations for Ra and Rb and if there is greater inconsistency in the relative performance of x1 and x2 for the scenarios considered in the calculation of Ra and Rb. Consequently, if both of these conditions are met, one would expect ranking stability to be low (bottom‐left quadrant, Figure 4). For example, when Ra and Rb correspond to minimax regret and percentile based peakedness, there is a high degree of difference in performance value transformation (T1), scenario subset selection (T2) and robustness metric calculation (T3) (y‐axis). For the former, performance values are transformed to regret, S′ consists of one scenario (worst‐case scenario) and there is no need to combine any values as part of the robustness metric calculation (T3). For the latter, the actual performance values are used, S′ consists of four scenarios (10th, 25th, 75th, and 90th percentiles) and the robustness metric calculation is the kurtosis of the four regret values (see Tables 3 and 5). Similarly, there is a potentially high degree of inconsistency in the relative performance values used for the calculation of Ra and Rb (x‐axis), as minimax regret uses the worst‐case scenario, whereas percentile‐based peakedness uses four scenarios spread evenly across the distribution of S (Table 3). Consequently, one would expect the ranking of decision alternatives to be generally unstable when these two metrics are used.
4.3 Different Transformations and Consistent Relative Performance
In cases where there are marked differences in the three transformations for Ra and Rb but consistency in the relative performance of x1 and x2 over the scenarios considered in the calculation of Ra and Rb (bottom‐right quadrant, Figure 4), ranking stability can range from high to low. This is because the interactions between various drivers of ranking stability are complex and difficult to predict a priori. For example, when maximax and maximin correspond to Ra and Rb, there is a high degree of similarity in the three transformations (y‐axis). For both metrics, the actual performance values are used (T1 is the identity transform), S′ corresponds to a single scenario and there is no need to combine any values as part of the robustness metric calculation (T3), as there is only a single value of performance (Table 5). However, there is a potentially low degree of consistency in the relative performance values used in the robustness calculations (x‐axis), as the single performance values used in the calculations of these two robustness metrics come from different ends of the distribution of performance values (i.e., one corresponds to the best‐case and one to the worst‐case). Consequently, this case belongs to the top‐left quadrant in Figure 4, where ranking stability can vary from low to high, depending on the consistency in relative performance of x1 and x2 for the best‐ and worst‐case scenarios.
4.4 Similar Transformations and Inconsistent Relative Performance
In cases where the three transformations for Ra and Rb are similar but the relative performance of x1 and x2 is inconsistent over the scenarios considered in the calculation of Ra and Rb (top‐left quadrant, Figure 4), ranking stability can also range from high to low due to the complex interactions between the different drivers affecting ranking stability. For example, when Laplace's principle of insufficient reason and percentile based skewness correspond to Ra and Rb, there is a moderate degree of difference in the three transformations (y‐axis). Both use actual performance values, but the former uses values from all scenarios and averages them, whereas the latter uses the 10th, 50th, and 90th percentiles and calculates their skewness (see Tables 3 and 5). However, as both use values from similar regions of the performance distribution, it is likely that there is a high degree of consistency in the relative performance values used in the robustness calculation (x‐axis). Consequently, this case belongs to the bottom‐right quadrant in Figure 4, where ranking stability can vary from low to high, depending on the relative impact of using the average and skewness of performance values for the robustness metric calculation (T3).
5 Case Studies
Three case studies with different properties are used to test the conceptual model of ranking stability introduced in Section 4, as shown in Table 6. As can be seen, the case studies represent water supply systems and flood prevention systems, with decision variables including changes to existing infrastructure, construction of new infrastructure, and changes to operational rules or policies. The number of scenarios varies greatly in each case study (28–3000), as does the number of optimal decision alternatives considered (11–72).
| Name | Location | Decision variables, components of xi | Selected objectives and performance metrics, f(xi, S) | Number of scenarios, n, where S = {s1, …, sn} | Number of decision alternatives, m where X = {x1, …, xm} |
|---|---|---|---|---|---|
| Southern Adelaide water supply system | Adelaide, Australia | Construction of new water supply infrastructure (e.g., desalination plants, rainwater tanks, stormwater harvesting) and time of implementation | Reliability (water supply) | 125 | 72 |
| Lake Como |
Como, Italy |
Parameterization of policies to determine releases based on day of year, current lake storage and previous day inflow. |
Reliability (flood prevention) Reliability (water supply) |
28 | 19 |
| Waas |
Rhine delta, The Netherlands (hypothetical model based on the real River Waal) |
Changes to existing infrastructure for flood reduction and flood damage reduction, and changes to operations (e.g., limits to upstream maximum discharge). |
Flood damage Casualties |
3000 | 11 |
5.1 Southern Adelaide
This urban water supply augmentation case study models the southern region of the Adelaide water supply system, as it existed in 2010 (Beh et al., 2014, 2015a, 2015b, 2017; Paton et al., 2013, 2014a, 2014b). Adelaide has a population of approximately 1.3 million people and is the capital city of the state of South Australia. Characterized by a Mediterranean climate and an annual rainfall of between 257 and 882 mm (average of 552 mm) over the period from 1889 to 2010 (Paton et al., 2013), Adelaide is one of the driest capital cities in the world (Wittholz et al., 2008). The southern Adelaide system supplies approximately 50% of the water mains consumption (168 GL in 2008) (Beh et al., 2014).
In 2010, the southern Adelaide system consisted of three reservoirs to supply water, as illustrated in Figure 5: Myponga Reservoir collects water from local catchments; Mt. Bold Reservoir collects water both from local catchments and water pumped from the River Murray via the Murray Bridge—Onkaparinga pipeline; Happy Valley reservoir is a service reservoir storing water that has been transferred from the Mt. Bold Reservoir. Water from the River Murray is limited to a maximum of 650 GL over a 5‐year rolling period and it is assumed that half of this water is available to the southern Adelaide system.

Due to projected increases in demand and a changing climate there is a need to augment the water supply system (Paton et al., 2013). In particular, the River Murray will be greatly affected by climate change (Grafton et al., 2016a). This article considers 125 scenarios corresponding to various combinations of representative concentration pathways (RCPs) and global circulation models (GCMs) to project changes for future rainfall for the Adelaide system.
(6)5.2 Lake Como
Lake Como is the third largest Italian lake with a total volume of 23.4 km3. The lake is fed by a 4552 km2 watershed (see Figure 6) characterized by a mixed snow‐rain dominated hydrological regime with relatively dry winters and summers, and higher flows in spring and autumn due to snow‐melt and precipitation, respectively. The lake releases are controlled since 1946 with the twofold purpose of flood protection along the lake shores, particularly in the city of Como, and water supply to the downstream users, including eight run‐of‐the‐river hydropower plants and a dense network of irrigation canals, which distribute the water to four agricultural districts with a total surface of 1400 km2 mostly cultivated with maize (Giuliani et al., 2016a; Guariso et al., 1985, 1986).

To satisfy the summer water demand peak, the current regulation operates the lake to store a large fraction of the snowmelt in order to be, approximately, at full capacity between June and July (Denaro et al., 2017). The projected anticipation of the snow melt caused by increasing temperature, coupled with the predicted decrease of water availability in the summer period, would require storing additional water and for longer periods, ultimately increasing the flood risk. The optimal flood protection would be instead obtained by drawing down the lake level as much as possible (Giuliani & Castelletti, 2016).
Due to a changing climate and thus a changing flood risk (Giuliani & Castelletti, 2016; McDowell et al., 2014) and availability of water (Iglesias & Garrote, 2015), a climate ensemble of 28 scenarios was used for analysis by Giuliani and Castelletti (2016) and in the following analysis. These scenarios are combinations of different RCPs, and Global, and Regional Climate Models. The resulting trajectories of temperature and precipitation are then statistically downscaled by means of quantile mapping and used as inputs to a hydrological model to generate projections of the Lake Como inflows over the time‐period 2096–2100.
- Flooding: the storage reliability (to be maximized), defined as
(7)- Irrigation: the daily average volumetric reliability (to be maximized), defined as
(8)A previous study (Giuliani & Castelletti, 2016) generated 19 Pareto optimal decision alternatives by optimizing the flooding and irrigation objectives over historical climate conditions via evolutionary multiobjective direct policy search, a simulation‐based optimization approach that combines direct policy search, nonlinear approximating networks, and multiobjective evolutionary algorithms (Giuliani et al., 2016b). These optimal reservoir operation policies are used in the following analysis.
5.3 Waas
The Waas case study is a hypothetical case, based on a river reach in the Rhine delta of the Netherlands (the river Waal). An Integrated Assessment Meta Model is used (Haasnoot et al., 2012), which is theory motivated (Haasnoot et al., 2014) and has been derived from more detailed, validated models of the Waal area. The river and floodplain are highly schematized, but have realistic characteristics (see Figure 7), with the river being bound by embankments and the floodplain composed of five dike rings. In the southeast, a large city is situated on higher ground, while smaller villages exist in the remaining area. Other forms of land use include greenhouses, industry, conservation areas, and pastures. In the recent past, two large flood events occurred in the Waal area, on which this hypothetical case study is based, resulting in considerable damage to houses and agriculture (Haasnoot et al., 2009). In the future, changes in land use and climate, as well as socioeconomic developments, may further increase the risk of damage, so action is needed.

There is a wide range of uncertainties that are considered, including climate change and its impact on river discharge (see Haasnoot et al. (2012) for details) and land use change through seven transient land use scenarios. Uncertainty with respect to the fragility of dikes and economic damage functions is taken into account by putting a bandwidth of plus and minus 10% around the default values. Finally, some aspects of policy uncertainty are included both through the uncertainty of the fragility function and by letting the impact of the action vary (Kwakkel et al., 2015). These drivers of change are combined to form a total of 3000 scenarios.
Damage due to the flooding of dike rings is calculated from water depth and damage relations (De Bruijn, 2008; Haasnoot et al., 2009). Using these relations, the model calculates the flood impacts per hectare for each land use to obtain the total damage for sectors such as agriculture, industry, and housing. Casualties are assessed using water depth, land use, and flood alarms triggered by the probability of dike failure. These performance measures form the three objectives that are considered in the original studies (Kwakkel et al., 2015, 2016a): costs, loss of life, and economic damages. However, due to the fact that the costs were rarely effected by the scenario, this objective was not included in this study. In previous studies, a many‐objective robust optimization approach was used to design robust adaptation pathways (Kwakkel et al., 2015, 2016a) and 11 distinct adaptation pathways were identified. These optimal adaptation pathways are used in the following analysis.
6 Results and Discussion
To assess if the rankings of decision alternatives are likely to be similar between two metrics for the different case studies and objectives considered, the percentage of pairs of decision alternatives where the ranking is stable is used. A stable pair of decision alternatives is one where one of these decision alternatives is always ranked higher than another, regardless of the robustness metric used, as described in Section 4. The ranking stability for each pair of metrics is displayed in Figure 8. A ranking stability of 100% indicates that the metrics agreed on the rankings for every pair of decision alternatives, while 0% indicates that one metric ranked the decision alternatives in reverse to the other metric. The robustness values for each case study are included in Supporting Information S1. Figure 8 also provides basic information about the three transformations used in the calculation of each robustness metric in an effort to assess how well the results agree with the conceptual model presented in Figure 4.

6.1 Impact of Transformations
Figure 8 indicates that the pairs of metrics with high stability (lower portion of the figure, shaded mostly green), tend to share the same robustness metric calculation transformation (T3). For example, in cases where both metrics use the identity transformation, sums or averages of f′(xi, S) (all indicated by “M” in the T3 columns), rankings are generally stable. In contrast, the metrics with low stability (upper portion of Figure 8, shaded mostly red and yellow) tend to have different robustness metric calculation transformations. An example is the percentile‐based peakedness metric, being the only metric to use kurtosis. Every other metric uses a different robustness metric calculation transformation and hence when percentile‐based peakedness is used as one of the two robustness metrics considered, rankings are generally unstable. This can be explained by the fact that when different types of calculations from f′(xi, S) to R(xi, S) are used, different attributes of the distribution of f′(xi, S) result in “similarity,” as discussed in Section 4. For example, as can be seen in Figure 4, two metrics that use different robustness metric calculation transformations (T3) will result in low stability unless there are consistent differences between two decision alternatives over the different scenarios.
In general, a pair of metrics with the same robustness metric calculation transformation (T3) almost always has high ranking stability, while a pair with a different T3 almost always has low ranking stability. However, Figure 8 indicates the same is not always true of the other two transformations (i.e., performance value transformation (T1) and scenario subset selection (T2)), although in some cases, they can have an impact. For example, the maximax and maximin metrics share the same robustness metric calculation transformation (T3). However, their ranking stability is markedly lower than that for other metrics that share the same T3, particularly for the Adelaide and Lake Como case studies. In this case, the primary cause of ranking stability is associated with scenario subset selection (T2). The selected scenarios S′ for the maximin and maximax criteria correspond to different extremes of the distribution of S and hence these two metrics show high levels of disagreement. This puts the comparison of these two metrics in the middle or lower region of Figure 4 and explains the large variance in the ranking stability of the maximin and maximax metrics in Figure 8. This variance in ranking stability is particularly clear when there is not a large consistent difference in performance between decision alternatives. The maximax metric is also different from most other metrics, although to a lesser extent than the difference with the maximin metric, and it can be seen in Figure 8 that this results in variable levels of agreement between the maximax metric and the other metrics in each case study.
Similarly, the undesirable deviations metric uses the sum of f′(xi, S) and is hence categorized with many other metrics when considering the robustness metric calculation transformation (T3). Like the maximin and maximax comparison, the undesirable deviations metric shows varying ranking stability depending on the case study. The complex effects of the performance value transformation (T1) explain this. Regret of a decision alternative in each scenario is used by the undesirable deviations metric, compared to most metrics, which use the actual performance values. This calculation of regret is also different from that of the other regret metrics (minimax regret and 90th percentile minimax regret) because it is considering regret relative to the median performance of that decision alternative, rather than regret relative to the absolute best performance across all decision alternatives.
A relatively low level of agreement is seen when comparing the maximax and undesirable deviations (Figure 8). Similar to the above discussion, this variability is due to the different sampling methods for the scenario subset selection (T2) and different performance value transformations (T1). Maximax samples a single value from the left‐hand side of the distribution, whereas the undesirable deviations metric samples the 50% of values from the right‐hand side of the distribution. In addition, there is also a difference in the initial performance value transformation (T1), with the maximax metric using the raw performance values, while the undesirable deviations metric uses the regret of a decision alternative relative to the median performance.
6.2 Impact of Relative Performance
As can be seen in Figure 8, although there is generally a high degree of consistency in ranking stability based on the similarity between the three transformations, this does not hold for certain combinations of robustness metrics and case studies/objectives. This is because ranking stability is not only affected by the similarities in/differences between robustness metrics, but also the similarities/differences in the relative performance of two decision alternatives under the different scenarios considered (see Figure 4). For example, as can be seen in Figure 8, ranking stability for the Adelaide case study is low when the maximin metric is paired with other metrics that also used the same type of robustness metric calculation transformation (T3), while this is not the case for the other case studies. In this case, this is because many of the decision alternatives have a reliability of 0% in the worst‐case scenario, and due to the scenario subset selection (T2), the maximin metric only considers this worst‐case scenario and thus ranks many of the decision alternatives as equal. Other metrics with different scenario subset selection methods use different scenarios (which vary depending on the decision alternative) or use more scenarios and thus rank the decision alternatives differently.
It is also worth noting the high level of disagreement obtained in the Lake Como case for the undesirable deviations when considering the reliability of water supply for irrigation. This effect does not appear when considering the reliability against flooding. This asymmetry can be explained by the fact that the IPCC projections in the alpine region consistently suggest a decrease of water availability in the summer period due to a change in the snow accumulation/melting dynamics. In fact, the impacts of global warming are expected to reduce the precipitation that falls as snow in winter and, at the same time, to reduce snow melt. The combined effect of this reduction of snow accumulation and reduction of the snow melt strongly challenges the possibility of filling up the lake to provide irrigation during the summer period. Yet, the temporal distribution of such effects can be different due to the variability in the considered scenarios, ultimately producing variable impacts on the performance of different operating policies, which implement different hedging strategies over time. The variable and asymmetric distribution of the resulting performance (toward degradation) is then captured by the metrics relying on a subset of values in the scenario subset selection transformation (T2) (i.e., undesirable deviations and the metrics relying on multiple percentiles), while other metrics do not recognize this effect and produce inconsistent rankings.
7 Summary and Conclusions
Metrics that consider local uncertainty (i.e., reliability, vulnerability, and resilience) have long been considered in environmental decision‐making. Due to deeply uncertain drivers of change including climate, technological and sociopolitical changes, decision‐makers have begun to consider multiple scenarios (plausible futures) and robustness metrics to quantify the performance of decision alternatives under deep uncertainty. A large variety of robustness metrics has been considered in recent research with little discussion of the implications of using each metric, and little understanding of the way the metrics are similar or different. However, it has become clear that the choice of robustness metric can have a large effect, with metrics sometimes showing disagreement with regard to which decision alternative is more robust.
This article presents a unifying framework for the calculation of robustness metrics derived from three major transformations (performance value transformation (T1), scenario subset selection (T2) and robustness metric calculation (T3)) used to convert system performance values (e.g., reliability) into the final value of robustness that can be used to rank decision alternatives. The performance value transformation (T1) converts the original performance values into the information that the decision‐maker is interested in. The second transformation (T2) corresponds to the selection of which scenarios (and associated system performance values) the metric will use. The final transformation (T3) involves the conversion of transformed performance values over the selected scenarios into a single value of robustness.
This article also presents a conceptual framework for assessing the stability of the ranking of different decision alternatives when different robustness metrics are used. The framework indicates that the greater the similarity in the three transformations for robustness metrics, the more stable the ranking of decision alternatives that use these metrics is and vice versa. Ranking stability is also affected by the degree of consistency of the relative performance of different decision alternatives across the scenarios, where ranking stability is increased if one decision alternative always outperforms the other and vice versa. In order to test this conceptual understanding of ranking stability when different robustness metrics are used, the stability of any two metrics was determined for five objectives in three case studies, which confirmed the proposed conceptual model. The robustness metric calculation (T3) was found to be the most influential of the three transformations in determining ranking stability, however, the other two transformations also contributed.
In conclusion, robustness metrics can be split into three transformations, which provides a unifying framework for the calculation of robustness. This framework helps decision‐makers understand when different robustness metrics should be used by considering (1) the information the decision context relates to most (e.g., absolute performance, regret, or the satisfaction of constraints) (performance value transformation (T1)), (2) the preference of a decision‐maker toward a high or low level of risk aversion for the case study under consideration through scenario subset selection (T2), and (3) the decision‐maker's preference toward maximizing average performance, minimizing variance, or some other higher‐order moment, as described by the robustness metric calculation (T3). These three transformations and the properties of the case studies are useful in describing why rankings of decision alternatives obtained using different robustness metrics sometimes disagree.
Acknowledgments
Thanks is given to Leon van der Linden for his guidance on behalf of SA Water Corporation (Australia) who support the research of Cameron McPhail through Water Research Australia, as well as the comments from the anonymous reviewers, which have improved the quality of this article significantly. The case study robustness data is included in Supporting Information S1.
References
Citing Literature
Number of times cited according to CrossRef: 43
- Jia Yi Ng, Samira Fazlollahi, Stefano Galelli, Do Design Storms Yield Robust Drainage Systems? How Rainfall Duration, Intensity, and Profile Can Affect Drainage Performance, Journal of Water Resources Planning and Management, 10.1061/(ASCE)WR.1943-5452.0001167, 146, 3, (04020003), (2020).
- A.P. Hurford, J.J. Harou, L. Bonzanigo, P.A. Ray, P. Karki, L. Bharati, P. Chinnasamy, Efficient and robust hydropower system design under uncertainty - A demonstration in Nepal, Renewable and Sustainable Energy Reviews, 10.1016/j.rser.2020.109910, 132, (109910), (2020).
- Federico Giudici, Andrea Castelletti, Matteo Giuliani, Holger R. Maier, An active learning approach for identifying the smallest subset of informative scenarios for robust planning under deep uncertainty, Environmental Modelling & Software, 10.1016/j.envsoft.2020.104681, 127, (104681), (2020).
- Takuya Iwanaga, Daniel Partington, Jenifer Ticehurst, Barry F.W. Croke, Anthony J. Jakeman, A socio-environmental model for exploring sustainable water management futures: Participatory and collaborative modelling in the Lower Campaspe catchment, Journal of Hydrology: Regional Studies, 10.1016/j.ejrh.2020.100669, 28, (100669), (2020).
- Jonathan D. Herman, Julianne D. Quinn, Scott Steinschneider, Matteo Giuliani, Sarah Fletcher, Climate Adaptation as a Control Problem: Review and Perspectives on Dynamic Water Resources Planning Under Uncertainty, Water Resources Research, 10.1029/2019WR025502, 56, 2, (2020).
- J.E. Tomlinson, J.H. Arnott, J.J. Harou, A water resource simulator in Python, Environmental Modelling & Software, 10.1016/j.envsoft.2020.104635, (104635), (2020).
- Sarah St. George Freeman, Casey Brown, Hector Cañada, Veronica Martinez, Adriana Palma Nava, Patrick Ray, Diego Rodriguez, Andres Romo, Jacob Tracy, Eduardo Vázquez, Sungwook Wi, Frederick Boltz, Resilience by design in Mexico City: A participatory human-hydrologic systems approach, Water Security, 10.1016/j.wasec.2019.100053, 9, (100053), (2020).
- Omar Castrejon-Campos, Lu Aye, Felix Kin Peng Hui, Making policy mixes more robust: An integrative and interdisciplinary approach for clean energy transitions, Energy Research & Social Science, 10.1016/j.erss.2020.101425, 64, (101425), (2020).
- D. E. Gorelick, L. Lin, H. B. Zeff, Y. Kim, J. M. Vose, J. W. Coulston, D. N. Wear, L. E. Band, P. M. Reed, G. W. Characklis, Accounting for Adaptive Water Supply Management When Quantifying Climate and Land Cover Change Vulnerability, Water Resources Research, 10.1029/2019WR025614, 56, 1, (2020).
- Akhmad Hidayatno, Bramka Arga Jafino, Andri D. Setiawan, Widodo W. Purwanto, When and why does transition fail? A model-based identification of adoption barriers and policy vulnerabilities for transition to natural gas vehicles, Energy Policy, 10.1016/j.enpol.2020.111239, 138, (111239), (2020).
- Hale Cetinay, Carmen Mas-Machuca, Jose L. Marzo, Robert Kooij, Piet Van Mieghem, Comparing Destructive Strategies for Attacking Networks, Guide to Disaster-Resilient Communication Networks, 10.1007/978-3-030-44685-7_5, (117-140), (2020).
- Antonia Hadjimichael, Julianne Quinn, Erin Wilson, Patrick Reed, Leon Basdekas, David Yates, Michelle Garrison, Defining Robustness, Vulnerabilities, and Consequential Scenarios for Diverse Stakeholder Interests in Institutionally Complex River Basins, Earth's Future, 10.1029/2020EF001503, 8, 7, (2020).
- Jose M. Gonzalez, James E. Tomlinson, Julien J. Harou, Eduardo A. Martínez Ceseña, Mathaios Panteli, Andrea Bottacin-Busolin, Anthony Hurford, Marcelo A. Olivares, Afzal Siddiqui, Tohid Erfani, Kenneth M. Strzepek, Pierluigi Mancarella, Joseph Mutale, Emmanuel Obuobie, Abdulkarim H. Seid, Aung Ze Ya, Spatial and sectoral benefit distribution in water-energy system design, Applied Energy, 10.1016/j.apenergy.2020.114794, 269, (114794), (2020).
- Erin Bartholomew, Jan H. Kwakkel, On considering robustness in the search phase of Robust Decision Making: A comparison of Many-Objective Robust Decision Making, multi-scenario Many-Objective Robust Decision Making, and Many Objective Robust Optimization, Environmental Modelling & Software, 10.1016/j.envsoft.2020.104699, 127, (104699), (2020).
- C. McPhail, H. R. Maier, S. Westra, J. H. Kwakkel, L. Linden, Impact of Scenario Selection on Robustness, Water Resources Research, 10.1029/2019WR026515, 56, 9, (2020).
- Andrew L. Hamilton, Gregory W. Characklis, Patrick M. Reed, Managing Financial Risk Trade‐Offs for Hydropower Generation Using Snowpack‐Based Index Contracts, Water Resources Research, 10.1029/2020WR027212, 56, 10, (2020).
- Ali Aghazadeh Ardebili, Elio Padoano, A Literature Review of the Concepts of Resilience and Sustainability in Group Decision-Making, Sustainability, 10.3390/su12072602, 12, 7, (2602), (2020).
- Robert J. Lempert, Sara Turner, Engaging Multiple Worldviews With Quantitative Decision Support: A Robust Decision‐Making Demonstration Using the Lake Model, Risk Analysis, 10.1111/risa.13579, 0, 0, (2020).
- Julia Reis, Julie Shortridge, Impact of Uncertainty Parameter Distribution on Robust Decision Making Outcomes for Climate Change Adaptation under Deep Uncertainty, Risk Analysis, 10.1111/risa.13405, 40, 3, (494-511), (2019).
- Enayat A. Moallemi, Sondoss Elsawah, Michael J. Ryan, Robust decision making and Epoch–Era analysis: A comparison of two robustness frameworks for decision-making under uncertainty, Technological Forecasting and Social Change, 10.1016/j.techfore.2019.119797, (119797), (2019).
- D. F. Gold, P. M. Reed, B. C. Trindade, G. W. Characklis, Identifying Actionable Compromises: Navigating Multi‐City Robustness Conflicts to Discover Cooperative Safe Operating Spaces for Regional Water Supply Portfolios, Water Resources Research, 10.1029/2019WR025462, 55, 11, (9024-9050), (2019).
- B.C. Trindade, P.M. Reed, G.W. Characklis, Deeply Uncertain Pathways: Integrated Multi-City Regional Water Supply Infrastructure Investment and Portfolio Management, Advances in Water Resources, 10.1016/j.advwatres.2019.103442, (103442), (2019).
- C. P. Libisch‐Lehner, H. T. T. Nguyen, R. Taormina, H. P. Nachtnebel, S. Galelli, On the Value of ENSO State for Urban Water Supply System Operators: Opportunities, Trade‐Offs, and Challenges, Water Resources Research, 10.1029/2018WR023622, 55, 4, (2856-2875), (2019).
- Mehmet Ümit Taner, Patrick Ray, Casey Brown, Incorporating Multidimensional Probabilistic Information Into Robustness‐Based Water Systems Planning, Water Resources Research, 10.1029/2018WR022909, 55, 5, (3659-3679), (2019).
- Jochen Hinkel, John A. Church, Jonathan M. Gregory, Erwin Lambert, Gonéri Le Cozannet, Jason Lowe, Kathleen L. McInnes, Robert J. Nicholls, Thomas D. Pol, Roderik Wal, Meeting User Needs for Sea Level Rise Information: A Decision Analysis Perspective, Earth's Future, 10.1029/2018EF001071, 7, 3, (320-337), (2019).
- Jennifer Badham, Sondoss Elsawah, Joseph H.A. Guillaume, Serena H. Hamilton, Randall J. Hunt, Anthony J. Jakeman, Suzanne A. Pierce, Valerie O. Snow, Meghna Babbar-Sebens, Baihua Fu, Patricia Gober, Mary C. Hill, Takuya Iwanaga, Daniel P. Loucks, Wendy S. Merritt, Scott D. Peckham, Amy K. Richmond, Daniel Ames, Gabriele Bammer, Effective modeling for Integrated Water Resource Management: A guide to contextual practices by phases and steps and future opportunities, Environmental Modelling & Software, 10.1016/j.envsoft.2019.02.013, (2019).
- Jan H. Kwakkel, Marjolijn Haasnoot, Supporting DMDU: A Taxonomy of Approaches and Tools, Decision Making under Deep Uncertainty, 10.1007/978-3-030-05252-2, (355-374), (2019).
- Pavel Vychuzhanin, Nikolay O. Nikitin, Anna V. Kalyuzhnaya, Robust Ensemble-Based Evolutionary Calibration of the Numerical Wind Wave Model, Computational Science – ICCS 2019, 10.1007/978-3-030-22734-0_45, (614-627), (2019).
- Michael Di Matteo, Holger R. Maier, Graeme C. Dandy, Many-objective portfolio optimization approach for stormwater management project selection encouraging decision maker buy-in, Environmental Modelling & Software, 10.1016/j.envsoft.2018.09.008, 111, (340-355), (2019).
- Khu-rai Kim, Youngjae Kim, Sungyong Park, undefined, 2019 IEEE 27th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), 10.1109/MASCOTS.2019.00034, (241-248), (2019).
- Enayat A. Moallemi, Hasan H. Turan, Sondoss Elsawah, Michael J. Ryan, Comparing Implications of ‘Robustness’ and ‘Optimality’ for Decision Support under Uncertainty, INCOSE International Symposium, 10.1002/j.2334-5837.2019.00662.x, 29, 1, (1198-1208), (2019).
- Enayat A. Moallemi, Sondoss Elsawah, Michael J. Ryan, Strengthening ‘good’ modelling practices in robust decision support: A reporting guideline for combining multiple model-based methods, Mathematics and Computers in Simulation, 10.1016/j.matcom.2019.05.002, (2019).
- Jory S. Hecht, Paul H. Kirshen, Minimizing Urban Floodplain Management Regrets under Deeply Uncertain Climate Change, Journal of Water Resources Planning and Management, 10.1061/(ASCE)WR.1943-5452.0001012, 145, 2, (04018096), (2019).
- Alessio Ciullo, Karin M. Bruijn, Jan H. Kwakkel, Frans Klijn, Accounting for the uncertain effects of hydraulic interactions in optimising embankments heights: Proof of principle for the IJssel River, Journal of Flood Risk Management, 10.1111/jfr3.12532, 12, S2, (2019).
- Enayat A. Moallemi, Fateme Zare, Patrick M. Reed, Sondoss Elsawah, Michael J. Ryan, Brett A. Bryan, Structuring and evaluating decision support processes to enhance the robustness of complex human–natural systems, Environmental Modelling & Software, 10.1016/j.envsoft.2019.104551, (104551), (2019).
- Alessio Ciullo, Karin M. De Bruijn, Jan H. Kwakkel, Frans Klijn, Systemic Flood Risk Management: The Challenge of Accounting for Hydraulic Interactions, Water, 10.3390/w11122530, 11, 12, (2530), (2019).
- J. R. Lamontagne, P. M. Reed, G. Marangoni, K. Keller, G. G. Garner, Robust abatement pathways to tolerable climate futures require immediate global action, Nature Climate Change, 10.1038/s41558-019-0426-8, (2019).
- Samuel A. Markolf, Mikhail V. Chester, Daniel A. Eisenberg, David M. Iwaniec, Cliff I. Davidson, Rae Zimmerman, Thaddeus R. Miller, Benjamin L. Ruddell, Heejun Chang, Interdependent Infrastructure as Linked Social, Ecological, and Technological Systems (SETSs) to Address Lock‐in and Enhance Resilience, Earth's Future, 10.1029/2018EF000926, 6, 12, (1638-1659), (2018).
- Tohid Erfani, Kevis Pachos, Julien J. Harou, Real‐Options Water Supply Planning: Multistage Scenario Trees for Adaptive and Flexible Capacity Expansion Under Probabilistic Climate Change Uncertainty, Water Resources Research, 10.1029/2017WR021803, 54, 7, (5069-5087), (2018).
- Edoardo Borgomeo, Mohammad Mortazavi‐Naeini, Jim W. Hall, Benoit P. Guillod, Risk, Robustness and Water Resources Planning Under Uncertainty, Earth's Future, 10.1002/2017EF000730, 6, 3, (468-487), (2018).
- H.R. Maier, S. Razavi, Z. Kapelan, L.S. Matott, J. Kasprzyk, B.A. Tolson, Introductory overview: Optimization using evolutionary algorithms and other metaheuristics, Environmental Modelling & Software, 10.1016/j.envsoft.2018.11.018, (2018).
- Enayat A. Moallemi, Sondoss Elsawah, Michael J. Ryan, An agent-monitored framework for the output-oriented design of experiments in exploratory modelling, Simulation Modelling Practice and Theory, 10.1016/j.simpat.2018.09.008, 89, (48-63), (2018).
- Baihua Fu, Wendy S. Merritt, Barry F.W. Croke, Tony Weber, Anthony J. Jakeman, A review of catchment-scale water quality and erosion models and a synthesis of future prospects, Environmental Modelling & Software, 10.1016/j.envsoft.2018.12.008, (2018).





