COMBINING INFORMATION FROM MULTIPLE FLOOD PROJECTIONS IN A HIERARCHICAL BAYESIAN 1 FRAMEWORK 2

This study demonstrates, in the context of flood frequency analysis, the potential of a recently proposed 19 hierarchical Bayesian approach to combine information from multiple models. The approach explicitly 20 accommodates shared multi-­‐model discrepancy as well as the probabilistic nature of the flood 21 estimates, and treats the available models as a sample from a hypothetical complete (but unobserved) 22 set of models. The methodology is applied to flood estimates from multiple hydrological projections (the 23 Future Flows Hydrology dataset) for 135 catchments in the UK. The advantages of the approach are 24 shown to be: 1) to ensure adequate ‘baseline’ with which to compare future changes; 2) to reduce flood 25 estimate uncertainty; 3) to maximise use of statistical information in circumstances where multiple 26 weak predictions individually lack power, but collectively provide meaningful information; 4) to diminish 27 the importance of model consistency when model biases are large; and 5) to explicitly consider the 28 influence of the (model performance) stationarity assumption. Moreover, the analysis indicates that 29 reducing shared model discrepancy is the key to further reduction of uncertainty in the flood frequency 30 analysis. The findings are of value regarding how conclusions about changing exposure to flooding are 31 drawn, and to flood frequency change attribution studies.


Introduction 39
Multi--model ensembles (MMEs) --a set of predictions made by different models --are becoming widely 40 accepted for flood frequency change analysis [e.g. Arnell and Gosling, 2014;Hirabayashi et al., 2013;Kay 41 and Jones, 2012]. The use of multiple models results in large uncertainty around estimates of flood 42 magnitudes, due to both uncertainty in model selection and natural variability of (simulated) river flow 43 [Madsen et al., 1997]. The challenge is therefore to extract the most meaningful signal from the multi--44 model predictions, accounting for both model quality and uncertainties in individual model estimates. 45 Multi--model ensembles became increasingly widespread at the beginning of the 21 st century, with a 46 notable development being the IPCC reporting ensemble averages in the IPCC Third Assessment Report 47 [Penner et al., 2001]. Concurrently, a probabilistic view on ensemble predictions was introduced by 48 Räisänen and Palmer [2001], who assigned equal weights to each ensemble member. This was refined in 49 the reliability ensemble average (REA) approach developed by Giorgi and Mearns [2002]; the authors 50 proposed assigning weights to the ensemble member models based on each model's track record of bias 51 and its distance from the ensemble's consensus; this was extended via a Bayesian analysis for the 52 unknown model weights by Tebaldi et al. [2005]. Similarly, Bayesian Model Averaging method [Hoeting 53 et al., 1999;Raftery et al., 2005] provides an estimate for a quantity of interest as a weighted average of 54 predictive density functions provided by multiple models that are centered on the bias--corrected 55 forecasts, with the weights reflecting the models' relative predictive skill. Other examples of MME 56 analysis include hierarchical linear model fitting to aggregated observations using similarly aggregated 57 MME projections [Greene et al., 2006], and use of spatial statistics to model geographical patterns 58 [Furrer et al., 2007]. The typical challenge with model averaging is that the chosen weighting scheme 59 depends on an arbitrary selection of the performance metric, which itself becomes an additional source 60 of uncertainty. Further, a set of weights, derived via unavoidably arbitrary criteria and for a limited sub--61 of peak flows that exceed a pre--defined threshold, and the characteristics of exceedance flow 153 magnitudes. 154 PDS assumes that the occurrence of peak flows follows a Poisson process with parameter , while 155 exceedance flow magnitude is specified to follow a generalized Pareto (GP) distribution with parameters 156 and [Madsen et al., 1997]. One option is to use a fixed flow threshold ! (specific to each basin) for 157 both observed and the modeled flow time series. This allows comparison of flood frequency parameters 158 between the historic and future periods, as well as between the observed and simulated flow time 159 series. For example, the flow threshold can be selected based on the observed time series for the 160 historic period. However, using a fixed threshold may result in either too small (with less than one 161 occurrence per year) or too large a number of threshold--exceedance peak flows. The former may 162 propagate into a high degree of uncertainty regarding flood frequency, whereas the latter may violate 163 the assumption that peak flows are characterized by a GP distribution. For example, if the threshold is 164 chosen based on the historic observed flow records, and is consequently used with flows generated by a 165 model underestimating high flows, the number of threshold exceedances will be small; this may result in 166 unduly large uncertainty in estimated extremes. 167 Therefore, rather than using a fixed flow threshold, the present study employs a fixed average number 168 of annual threshold exceedances (e.g. five exceedances per year, on average) to define individual flow 169 thresholds for each time series [Coles, 2001, ch. 4;Madsen et al., 1997]. To allow the flood frequency 170 parameter inter--comparison, the PDS/GP parameters are converted into the annual maximum series/ 171 generalized extreme value distribution (AMS/GEV) parameters: location parameter , scale parameter 172 * and shape parameter (the same shape parameter as in the GP distribution) [Madsen et al., 1997]. 173 This establishes a direct correspondence between the three independent PDS/GP parameters ( ! is 174 defined by ) and three AMS/GEV parameters. Further particulars regarding the annual maximum 175 distribution corresponding to the PDS/GP model are discussed in Appendix A. 176 The six parameters (three each from one historic and one future period) can be combined into a vector 177 Here, the sub--index 'h' 178 indicates the parameters of the historic period, while the sub--script 'f' indicates the parameters of the 179 future period. Due to the subsequent reliance on Gaussian distributions, the descriptor is defined 180 using log--transformed values of the location parameter and scale parameter * , both of which are 181 strictly positive. The 'actual flood frequency' will be summarized using a descriptor , and outputs from 182 m models will be summarized using descriptors , = 1, . For ease of exposition, the notation used is 183 consistent with that of Chandler [2013], and is summarized in Table 1. 184 The uncertainty in an MME can be described by considering the descriptors , = 1, to be drawn 185 from some underlying probability distribution. Following Chandler

Hierarchical model 195
A Hierarchical Bayesian approach is employed to provide information regarding the actual hydrological 196 response descriptor using information contained in the descriptor estimates , = 0, [Chandler, 197 2013]. A set of assumptions regarding the distributions of various quantities in Figure 3 are required to 198 specify the posterior distribution of . The case when all of the required distributions are multivariate 199 normal (Gaussian) is examined here, and Section 4 discusses the applicability/validity of the 200 assumptions. This can be expressed as the following hierarchical model: 201 ~, (1) 202 denotes a covariance matrix of the ML estimate for the i th data source, matrix measures the 205 propensity of the i th model to deviate from the model consensus and is covariance matrix of shared 206 discrepancy . The discrepancies , = 1, are assumed to be independent given (see Section 4). 207 The assumption can be relaxed by extending the general framework in Figure 3 to include families of 208 models, which are centered on their own consensuses, with those centered on the ensemble consensus. 209 For a given catchment and a given hydrological model (but different climate variants), the descriptors 210 derived from simulations are considered to be equally credible, and the corresponding covariance 211 matrices are set to be equal. Consequently, this results in either one or two unique covariance 212 matrices , when, correspondingly, one or two hydrological models are used to represent the 213 catchment hydrological response. For simplicity, a case with one hydrological model is described below, 214 i.e. = = ⋯ = ; similar derivations can be obtained for a case with two distinct matrices . 215 The hierarchical structure of the model is completed with prior distributions for , , and . This is 216 done by choosing conjugate forms of these distributions, namely, 217 ~( , ) (4) 218 The marginal posterior for is sought given only the data , = 0, . The hyperparameters 224 (parameters of a prior distribution) , , σ, , are assumed to be known; and the covariance 225 matrices , = 0, are calculated (fixed) using the Fisher information matrix (Appendix A). However, 226 only the 'historic' part of can be defined in this way due to future observations not being knowable. 227 Therefore the 'future' components of are considered as estimated with zero precision giving 228 For convenience, the fixed parameters are excluded from the notation 229 while specifying the conditional distribution functions. The full conditional distribution of the descriptors ! !!! ! can be expressed as follows 245 The first term on the left--hand side is proportional to a product of normal distributions from (1), and the 247 second term is provided by Chandler [2013] as 248 I is the 6 x 6 identity matrix, and = + . 250 Further, combining the priors (5) and (6) with the likelihood for the random effects (2) and (3) produces 251 the updated Wishart full conditional distributions for the covariance matrices: 252 Lastly, using (2) and (3) it can be shown [Lindley and Smith, 1972] that the full conditional of is 255 Sampling sequentially from the full conditional distributions (7)  As argued by Geyer [1992], only the number of iterations for the auto--covariances to decay to a 265 negligible level is to be discarded, and that less than 1% of the run will normally be sufficient for a burn--266 in period (with a sample size of 10,000 order). Further, a visual inspection shows that trace plots of the 267 iteration history for the elements of ! (the primary quantities of interest) have been stabilized after 268 10,000 first (discarded) iterations. 269

Selection of hyperparameters 270
The hyperparameter values are chosen to determine very vague priors (4) -(6), namely = σ = 6 271 Appendix B suggests that the additional discrepancy in the future (term in the discussion following 291 equation B.2) is of the same magnitude as the historic discrepancy ; this means that the future 292 discrepancy could be as small as zero or as large as double the magnitude of the historical discrepancy. 293 When significantly longer observation records (longer than the 30 years used in the analysis) are 294 available, it is possible to use the remaining observations (i.e. the years that are not used in the 295 estimation process) to specify the prior on the discrepancy spread for both historic and future periods. 296 However, due to the limited time series of data availability in this study, as is typically the case, the 297 choice of is unavoidably subjective and there is no theoretical or empirical support for any specific 298 non--zero value of . We therefore adopted the standard strategy of specifying that = 0, and then 299 subsequently quantifying the sensitivity of outputs to this arbitrary selection of its numerical value. 300 The hyperparameters and reflect the scale of elements in the covariance matrices and , 301 respectively. While the posterior in (9) shows that there is substantial evidence available to update an 302 estimate for , there is only one data point available to update an estimate for as shown in the 303 posterior (10). Hence, an estimator of will be mainly based on the regionalized information used in the 304 procedure. 305

Applicability of the assumptions 306
The Gaussian specification in the previous sections relies on several assumptions. In this section we set 307 out the assumptions and report the results of tests that have been carried out to assess their validity for 308 the data used in this study. The assumptions (A0--A4) are as follows: 309 A0) The descriptors , = 0, summarize flow time series for flood frequency analysis. 310 The descriptors , = 0, represent transformed PDS/GP parameters. For each catchment, multiple 311 hypotheses that the occurrence of peak flows follows Poisson process and that exceedance flow 312 magnitude are drawn from a generalized Pareto distribution are tested for both observed and modeled 313 flows, and for both historic and future time periods. The suitability of the Poisson assumption has been 314 tested by means of the dispersion index [Cunnane, 1979], while the suitability of the Generalized Pareto 315 assumption has been tested by means of the Cramér-von Mises statistic [Choulakian and Stephens, 316 2001]. As the test requires testing multiple catchment--specific hypotheses (46 or 90 hypotheses per 317 catchment, depending on the number of models in the ensemble), the likelihood of a rare event 318 increases, and therefore, the likelihood to incorrectly reject a null hypothesis (Type I error) becomes 319 greater. The Holm-Bonferroni method [Holm, 1979] is used here to control the family--wise error rate 320 (the probability of witnessing one or more Type I errors) by adjusting the rejection criteria of each of the 321 individual hypotheses. All catchment--specific hypotheses cannot be rejected at 0.05 family--wise error 322 rate for 114 catchments out of 135 catchments considered; and are rejected for 1, 2, 3, and 6 individual 323 hypotheses (out of 46 or 90) for 5, 13, 2, and 1 catchments, correspondingly. These results are all 324 consistent with what is expected if assumption A0 holds. 325 A1) The descriptor estimates , = 0, are unbiased and Gaussian. 326 While the descriptor estimates are approximately multivariate normal for large samples [Millar, 2011, 327 ch. 12.2], it can be shown (Appendix C) that it still provides a reasonable approximation for the study 328 sample size of 150 flow peaks. It is to be noted, that for low shape parameter (e.g. κ = −1), the 329 descriptor distributions have 'fatter' tails and sharper 'peaks' than the corresponding asymptotic 330 multivariate normal distributions. 331 A2) The model discrepancies , = 1, are Gaussian. 332 The use of simulator--specific covariance matrices provides flexibility to accommodate outlying 333 simulators via distributions that are highly dispersed rather than heavy--tailed. cross--correlation is tested. It is found that the hypotheses cannot be rejected at a family--wise error rate 339 of 0.05 [Holm, 1979]. 340 A4) The shared discrepancy is Gaussian. 341 It is impossible to verify the appropriateness of the assumption, as a multi--model ensemble provides 342 only a single realization of ; the assumption is merely a convenient device to incorporate the shared 343 discrepancy formally into the analysis [e.g.  And while this study does not provide guidance on how to treat or resolve this sensitivity given the data 372 limitations, the study raises awareness that there is a significant problem that requires a solution. Figure  373 4b shows maximum sizes (defined via probability mass contained in the interval) of the posterior 374 credible intervals to exclude zero in Q100 change (K=0), so that the larger the interval size the stronger 375 the support for change in Q100. It shows, for example, that there are four river sites with zero change in 376 Q100 outside of the posterior 95% credible intervals, and a further thirteen when the credible interval 377 size is decreased to 90%. 378 Further, Figure 5 shows the posterior 95% credible intervals for the relative change in Q100 with respect 379 to Q100 estimated from the observed (historic) flow records, arranged in the observation--based Q100 380 increasing order (Q100 is in mm/day); so that the first set of intervals corresponds to a site with the 381 smallest observation--based Q100 estimate, and the last set corresponds to the largest observation--382 based Q100 estimate. A positive value of the relative change indicates a decrease in future Q100, and a 383 negative value indicates an increase in future Q100. The figure shows 95% credibility intervals for the 384 relative change in Q100 given by all models in the ensemble, a model that provides a median change in 385 Q100, and by combining information with K=0 and K=1. The uncertainty in the combined estimator 386 change for K= 0 is generally smaller than the uncertainty in all--model or median estimator changes, and 387 is somewhat bigger than that for the sites with the largest Q100 estimates (mm/day). The latter is 388 attributed to the model under--predictive bias for the sites with large Q100, and is discussed later in the 389 section. The comparison of the posterior credible intervals for K=0 and K=1 in Figure 5  the discrepancy between the models and observations is much larger than for the case in Figure 5a, 404 which inflates the uncertainty associated with the MME estimates (see the discussion for Figure 6 Table 2 shows the descriptor estimates for each data source for the catchment in Figure 6b (the 417 catchment with a large shared model discrepancy). MME strongly underestimates the scale parameter 418 * (log--transformed in the Table), underestimates the location parameter , and overestimates the 419 shape parameter that collectively results in underestimation of T--year return events by the MME 420 members (see A.3 and Figure 6b). Table 2  in historic 100 estimated from the observations and MME, and shows a relative uncertainty in the 431 combined estimates with respect to the largest of uncertainties in observational and MME estimates. 432 The uncertainty here is evaluated as the width of the 95% credible prediction intervals. Figure 7  433 demonstrates that combining information reduces uncertainty over the most uncertain information 434 source, i.e. models or observations (shown by black squares); and that accounting appropriately for the 435 model discrepancies (among the other factors) does not result in a prohibitively large uncertainty, and 436 thus may be valuable for practical applications. Improvement relative to the observation source can be 437 anticipated from eq. (8), indicating that the combined estimate precision matrix !! is no less than the 438 observation estimate precision matrix ! !! (a larger precision matrix corresponds to a better defined, or 439 less uncertain, estimate). This relationship, however, does not guarantee that the combined estimate is 440 better defined than an individual model estimate, due to additional shared multi--model discrepancy 441 (characterized by ). The combined estimate is superior to both observational and model estimates only 442 when model discrepancy is small. This is illustrated in Figure 7 that connects the relative uncertainty to 443 relative bias in (median) ensemble estimate of 100 with respect to observations (shown by grey stars) 444 that serves as a proxy for shared multi--model discrepancy. When 100 is considerably underestimated 445 by the ensemble (large negative values of relative bias), the MME provides over--confident estimates of 446 100 due to low variability in simulated flow; and the combined uncertainty of 100 is largely defined 447 by the observational uncertainty of 100 alone. 448 It is well--established that the magnitude of the 100--year flood is highly uncertain when estimated using 449 only a 30--year period of data [Hosking and Wallis, 1987]. It is, however, problematic to use longer 450 periods of data due to the heightened risk of inadvertently capturing time trends in flood frequency. This information--combining method hinges on a number of assumptions, and is a subject to various 478 limitations, as described in Section 3. The proposed method is general, however, allowing the 479 assumptions to be readily and transparently altered to reflect other beliefs and to evaluate the 480 sensitivity of outcomes. Representation of future model discrepancy is one of the most impactful, but 481 (based on the data available) not a priori knowable, assumptions (see discussion in Section 3.2.3). 482 Although its explicit representation and its differentiation between the past and the future may increase 483 uncertainty, the technique represents an advance in MME processing where it is typically (and 484 erroneously) assumed that model performance is similar between the past and the future. The latter is 485 comparable to the assumption that characteristics of the model discrepancy remain the same between 486 the two periods. When applied to a flood frequency change problem, the information--combining 487 method exhibits the following attractive properties: 488 1) It provides 'baseline' conditions that are consistent with the observed evidence, and therefore are 489 appropriate for comparison with future changes. 490 2) The method reduces uncertainty in flood estimates, as compared to the estimates of uncertainty 491 derived from both MME and observations. 492 3) For low values of shared model discrepancy, the approach allows extraction of information from 493 multiple small and weak effects by bringing together predictions that individually lack statistical 494 power, but collectively provide meaningful information. In the AMS/GEV approach, annual peak flow maxima are assumed to follow a generalized extreme value 520 distribution [Madsen et al., 1997] with the following cumulative distribution 521 The three parameters in (A.1) (ξ, α * , and ) are related to PDS/GP model parameters λ, q ! , α and , 523 such that the shape parameter is identical between the models, and 524 The mean annual number of threshold exceedances λ defines the flow threshold ! in PDS/GP. In the 526 UK, five events, on average, per year are included; with standard rules employed to ensure that 527 extracted flood peaks are independent events (by imposing a minimum separation time period of three 528 times time--to--peak, and specifying that the flow between two peaks must drop to at least two thirds of 529 the higher peak [Robson, 1999, p.276]). 530 While no explicit solution exists for the maximum likelihood ξ, * and estimator of ξ, * and , a 531 numerical procedure is typically applied based on the Newton--Raphson iteration [Hosking and Wallis, 532 1987;Madsen et al., 1997]. The asymptotic covariance matrix for the ML estimator is obtained by 533 inverting the Fisher information matrix given by Prescott and Walden [1983]. 534 The T--year return period event magnitude for AMS is defined as the 1 − 1/ quantile of the GEV 535 distribution (A.1), so that 536 is the Gumbel reduced variate. 538 The quantile estimator ! is defined by substituting estimators ξ, * and in the above equation The propensity to deviate around the model consensus is described by the covariance matrix and can 556 be estimated as 557 represents the number of models considered to be equally credible and = ! ! ! ! !!! represents the 559 mean of the estimated model descriptors. 560 = ! is the part of the covariance matrix corresponding to the historic period, and can be 561 As no data are available to estimate the remainder 562 of the matrix , the assumption about the MME shared discrepancy change in the future characterized 563 by a constant > 0 is used to provide an estimate of as 564 The assumption is that the shared discrepancy would change in the future as ! = ! + , where 566 random variable is independent of ! , has a mean of zero and a covariance matrix ! for some 567   [Mardia, 1970]  be generalized for a case with distinct matrices ! ). The model discrepancies are assumed to be 596 independent; and cross--covariances are considered below to establish a necessary condition for the 597 assumption. 598 When model discrepancies come from the same distribution for all catchments (i.e. and are 599 constant for all catchments), the sample cross--covariance based on n draws ! ! , ! ! !!! ! , ≠ can be 600 computed and the significance hypotheses can be tested. In a general case, the n draws cannot be used 601      Basins with the posterior 95% credible intervals that exclude zero change in Q100 are denoted by black circles (K=0).  The thick vertical red lines denote median estimate and the thin red lines denote the corresponding 95% credible intervals derived from the historic observations. Black crosses indicate median estimate and grey bars indicate the corresponding 95% credible intervals derived from the model predictions.
Blue dot is the median estimate and blue (cyan) lines are the corresponding 95% credible intervals derived from combined information when = 0 ( = 1); blue and cyan lines overlap for the historic period. Magenta star is 25 model estimate corresponding to the median change in return period.