Stress‐Dependent Magnitudes of Induced Earthquakes in the Groningen Gas Field

Geological faults may produce earthquakes under induced stresses associated with hydrocarbon extraction, geothermal extraction, or CO2 storage. The associated risks depend on the frequency and magnitude of these earthquakes. Within seismic risk analysis, the exceedance probability of seismic moments, ℳ , is treated as a pure power law distribution, ∼ℳ−β , where the power law exponent, β, may vary in time or space or with stress. Insights from statistical mechanics theories of brittle failure, statistical seismology, and acoustic emissions experiments all indicate this pure power law may contain an exponential taper, ∼ℳ−βe−ζℳ , where the taper strength, ζ, decreases with increasing stress. The role of this taper is to significantly reduce the probability of earthquakes larger than ζ −1 relative to the pure power law. We review the existing theoretical and observational evidence for a stress‐dependent exponential taper to motivate a range of magnitude models suitable for induced seismicity risk analysis. These include stress‐invariant models with and without a taper, stress‐dependent β models without a taper, and stress‐dependent ζ models. For each of these models, we evaluated their forecast performance within the Groningen gas field in the Netherlands using a combination of Bayesian inference, and simulations. Our results show that the stress‐dependent ζ model with constant β likely offers (75–85%) higher performance forecasts than the stress‐dependent β models with ζ=0 . This model also lowers the magnitudes with a 10% and 1% chance of exceedance over the next 5 years of gas production from 4.3 to 3.7 and from 5.5 to 4.3, respectively.


Introduction
Induced seismicity may arise due to mining, geothermal energy production, artificial lakes, and fluid injection or production, including hydrocarbon production, water disposal or CO 2 storage. Most of these activities occur without inducing any noticeable earthquakes. Nonetheless, due to the quantity and scale of these activities, there is a growing number of notable occurrences of induced earthquakes. Several recent reviews comprehensively summarize the worldwide evidence for seismicity induced by human activities (Davies et al., 2013;Ellsworth, 2013;Evans et al., 2012;Foulger et al., 2018;IEAGHG, 2013;Klose, 2013;Majer et al., 2007;NAS, 2013;Suckale, 2009).
In such cases of induced seismicity, any exposure to the associated hazards of seismic ground motions or to the risks of building damage must be assessed using probabilistic seismic hazard and risk analysis (e.g., van Elk et al., 2019), and if necessary mitigated. Induced seismicity is a transient, nonstationary process caused by time-varying increases in stress that are sufficient to destabilize previously inactive faults.
Forecasting such failures within a geological material critically depends on its heterogeneity (Vasseur et al., 2015). We will divide geological heterogeneity into two classes. First, resolvable heterogeneity that may be mapped and accounted for explicitly with deterministic models such as the large-scale geometries of geological faults and reservoirs that may be mapped by reflection seismic imaging. Second, unresolved heterogeneity, such as small-scale, spatial variations in geometric, poroelastic, frictional or prestress properties, that will influence frictional fault failures in ways that may only be fully characterized with statistical models. The space-time evolution of induced seismicity occurrences, magnitudes, and their variability within a region exposed to increasing stress loads will depend on the relative amounts of geological variability present within these two classes of heterogeneity: resolvable (ordered) and unresolved (disordered) variability.
In the limit prestress is the dominant source of disorder and that its variability significantly exceeds both the induced stress loads and earthquake stress transfers, then the initial occurrences of any induced seismicity will be governed by the probability distribution of extreme prestress values. This approach led to the development of statistical models of induced seismicity occurrence based on Extreme Threshold Theory where the resolvable heterogeneities are included in a deterministic poroelastic thin-sheet stress model and the unresolved heterogeneities are represented by the upper tail of a prestress probability distribution given by the universal form of a Generalized Pareto distribution (Bourne & Oates, 2017b). This simple model explains the observed, nonstationary, space-time statistics of induced seismicity within the Groningen gas field and provides a physical explanation for the exponential-like and near-instantaneous increase in seismicity rates relative to induced stress rates (Bourne et al., 2018). In this limit of strong prestress disorder, the probability distribution of extreme prestress values that influences the initiation of earthquakes will also influence the arrest of seismic slip, and therefore also the probability distribution of induced earthquake magnitudes.
Current methods of forecasting induced earthquake magnitudes are empirical and lack a clear physical basis. Natural and induced seismicity hazard analysis for the United States assumes a stationary process with a stress-invariant pure power law distribution of seismic moments (Petersen et al., 2018). (Shapiro et al., 2010a) proposed a nonstationary model for fluid injection-induced seismicity that includes a prestress disorder with a uniform distribution to model event occurrence but assumes a stress-invariant pure power law distribution of seismic moments (Langenbruch & Zoback, 2016;Shapiro, 2018;Shapiro et al., 2010b). Hazard analyses for Groningen-induced seismicity included a stress-dependent, pure power law distribution of seismic moments where the power law exponent varied with reservoir compaction (Bourne et al., 2014) or induced Coulomb stress (Bourne et al., 2018).
This study seeks to extend the method of treating unresolved heterogeneity as stochastic disorder to improve the seismological models used for forecasting induced earthquakes magnitudes for the purpose of seismic hazard and risk analysis. We will build on previous work to incorporate the failure mechanics of disordered media into a statistical mechanics theory of natural earthquakes (e.g., Alava et al., 2006;Bak & Tang, 1989;de Arcangelis et al., 2016), and their seismic hazard analysis (Main, 1996). These statistical mechanics theories will be used to motivate the choice of models to be evaluate but not used to rank or select them, which will instead be based on performance measures using the observed Groningen seismicity. Under many different theories the probability distribution of failure event sizes follows a power law subject to an exponential taper where the power law exponent is stress invariant while the characteristic taper scale increases as a critical point power law with stress. However, under some other circumstances the power law exponent may exhibit variation with stress. We account for these epistemic uncertainties by evaluating five alternative classes of frequency-moment models for induced earthquakes: 1. Stress-invariant power law with no taper 2. Stress-invariant power law with a stress-invariant taper 3. Stress-dependent power law with no taper 4. Stress-invariant power law with a stress-dependent taper 5. Stress-dependent power law with a stress-dependent taper Using Bayesian inference, we sample the full posterior distribution of possible models given the observed history of induced seismicity and inferred stress history within the Groningen gas field for a range of alternative parametrization choices within each of the five model classes. Our evaluation of the Groningen pseudo-prospective forecast performance for induced earthquake magnitudes reveals the best performing models all include a stress-dependent taper as anticipated by most statistical mechanics theories of brittle fracture.
After briefly stating the standard power law formulation of seismic moments in statistical seismology (section 2), we will summarize the seismological literature that proposes (section 3) or opposes (section 4) evidence for stress-dependent variations of power law exponent with stress. We will then describe our model of intrareservoir-induced stress due to pore pressure changes (section 5) followed by simple statistical analyses of the variations in observed earthquake magnitudes induced by Groningen gas production (section 6). Then after reviewing existing statistical mechanics theories of earthquakes (section 7), we specify our models for the stress dependence of induced earthquake magnitude distributions (section 10), infer their parameter values (section 11), and evaluate their performance (section 12), before assessing their implications for seismic activity rates (supporting information) and seismic hazard (sections 13 and 14).

Power Law Distribution of Seismic Moments
The exceedance probability distribution of earthquake magnitudes typically takes the following form: where M is the earthquake moment magnitude conditional on M ≥ M m and b defines the negative slope of the exponential distribution (Gutenberg & Richter, 1954). Alternatively, this may be expressed according to the seismic moment, M, which scales with moment magnitude as log 10 M ¼ ðc þ dMÞ; (2) with c ¼ 9:1 and d ¼ 1:5. Combining 1 and 2 leads to the equivalent power law distribution, and b ¼ βd. Seismic hazard and risk analysis is highly influenced by the estimation of β values as lower β values mean larger expected magnitudes. In the next two sections we outline the existing evidence for two alternate hypotheses about the influence of stress on β values.

The β Values Vary With Stress
A number of observations and modeling results suggest that b value depends on the stress level. Measured natural earthquake b values decrease systematically from 1.2 to 0.8 with increasing depths in the brittle crust from 5 to 15 km (Mori & Abercrombie, 1997;Spada et al., 2013). Similar measurements indicate earthquake b values vary systematically with focal mechanism rake angle as a proxy for stress (Gulia & Wiemer, 2010;Schorlemmer et al., 2005). Lower stress, normal faulting b values are typically 1.0-1.2. Whereas higher stress, thrust faulting b values are typically 0.7-0.9. Intermediate stress, strike-slip b values are in the range 0.9-1.0 (Huang et al., 2018;Wiemer & Wyss, 1997, 2002. Variations in measured b values may be a proxy for shear stress and pore pressure variations (Bachmann et al., 2012;Bourne et al., 2014Bourne et al., , 2018Scholz, 1968;Schorlemmer et al., 2005), or for material heterogeneity (Main et al., 1992;Mogi, 1962;Mori & Abercrombie, 1997) or for fault asperity mapping (Tormann et al., 2014). The scale of fault heterogeneity appears to follow a power law where its fractal dimension governs the b value of seismic slip events within this fault population (Main et al., 1989(Main et al., , 1990(Main et al., , 1992. Initial heterogeneities in the form of a fractal distribution of fault sizes or fault asperities are one way to explain the Gutenberg-Richter law. Another explanation is that it arises from some distribution of strength or stress heterogeneity (Langenbruch & Shapiro, 2014). Variations in observed b values may also be precursors of future rupture areas and sizes (Schorlemmer et al., 2005). In this case, b values appear to decrease monotonically throughout the precursory phase and then recover abruptly after peak stress (marked by a sudden stress drop event). Scholz (1968) introduced a statistical model of brittle failure within an heterogeneous elastic medium to explain the apparent decrease in b values with increasing stress.

The β Values Do Not Vary With Stress
In this hypothesis all observed b values are considered consistent with a constant value in both space and time and any observed apparent variations are associated with artifacts due to undersampling (detection threshold and finite sample size effects), magnitude errors, nonhomogeneous detection capabilities, and improper statistical tests (Amitrano, 2012;Amorèse et al., 2010;Frohlich & Davis, 1993;Kagan, 1999Kagan, , 2002bKagan, , 2010Kamer & Hiemer, 2015;Shi & Bolt, 1982). For example, observed variations in b values with stress rely on the maximum likelihood estimator (Aki, 1965), with corrections for the magnitude binning (Bender, 1983;Tinti & Mulargia, 1987;Utsu, 1965). This method implicitly assumes that the underlying distribution is a pure power law above some threshold of completeness according to Equation 3. If this is not the case, then this estimator will be biased and confounded with any non-power-law stress-dependent variations in the frequency-magnitude distribution, as we will show later. Recent developments in statistical fracture and earthquake mechanics theories for highly disordered media indicate failure sizes are distributed as a stress-invariant power law with a stress-dependent exponential taper.

Poroelastic Thin-Sheet Stress Model
The development of induced stress loads on preexisting weak fault structures within the Groningen gas field depends on the evolution of reservoir deformations induced by pore pressure changes. Within the limit of small incremental strains, these reservoir deformations are well described by linear poroelasticity. For thin reservoir geometries where the lateral extent of the reservoir greatly exceeds its thickness, the reservoir deforms as a thin sheet. Within the poroelastic, thin-sheet approximation, depletioninduced reservoir displacement vector field, u(x), is limited by symmetry to vertical displacements, uðxÞẑ, where x is the position vector andẑ is the unit vertical vector. Under these conditions the vertically averaged, maximum, incremental Coulomb reservoir stress states, ΔC, are (Bourne & Oates, 2017b, Equation 53): where t denotes time, μ is the coefficient of fault friction, ΔP is the vertically averaged change in reservoir pore fluid pressure, ϵ zz is the vertically averaged reservoir compaction strain, Γ(x) is the magnitude of the horizontal gradient vector in the elevation of the top surface bounding the thin sheet, γ ¼ ν=ð1 − 2νÞ where ν is the drained Poisson ratio taken to be 0.25, and H is a poroelastic modulus defined as H s is a constant related to the shear modulus of the skeleton material comprising the poroelastic medium and estimated as a model parameter and H r ðxÞ ¼ ΔPðx; tÞ=ϵ zz ðx; tÞ is the local, time-invariant ratio of the observed reservoir depletion to the observed reservoir compaction strain. Reservoir compaction strain, ϵ zz (x, t), is inferred from geodetic monitoring of surface displacements, reservoir depletion, ΔP(x, t), is measured by in-well pressure gauges, and Γ(x) is computed from the top reservoir surface mapped by reflection seismic imaging.
In the limit of large topographic gradients, Γ ≫ 1 > μ, as typically associated with preexisting fault offsets, the thin-sheet approximation for incremental maximum Coulomb reservoir stress reduces to where α is Biot's coefficient such that α ¼ H=H r . In this limit, ΔC is independent of fault friction, μ, as large topographic gradients amplify the induced stresses without affecting the changes in normal stresses. In general, the areal distribution of vertical strains, ϵ zz (x) will not conform to the areal distribution of pore pressure changes, ΔP(x) due to lateral variations in the poroelastic modulus, H(x). Depending on the poroelastic constant, H s , the ΔC stress field will tend to conform to either the incremental reservoir strains, ϵ zz , or the pore pressure changes, ΔP, both modulated by the areal distribution of topographic gradients, Γ(x): In this way, H s governs the influence of the resolvable elastic heterogeneity field, as measured by H r (x), on the induced stress field. Smaller H s values increase the influence of elastic heterogeneity until the induced stress is entirely driven by the incremental reservoir strains, ϵ zz (x). In contrast, larger H s values reduce the influence of elastic heterogeneity until induced stresses are driven by incremental reservoir pore pressures, ΔP(x). Since H s is not a directly observable quantity, we infer its distribution of possible values as part of the overall model inference given the observed seismicity.
Two modifications help to improve the performance of this thin-sheet stress model for induced seismicity forecasting; one to exclude stresses associated with aseismic fault slip, and another to exclude extraneous details in lateral stress variability. First, we filter the contribution of individual faults to the topographic gradient field, Γ(x), according to their juxtaposition geometry with the reservoir, by including fault segments according to the criterion: where r is the local ratio of fault throw to reservoir thickness and r max is a model parameter. This represents the consequences of juxtaposition, where faults that significantly offset the reservoir against the overlying and ductile Zechstein salt formation may be excluded as sources of induced seismicity.
where G(x, x′) is the isotropic Gaussian kernel: This poroelastic, thin-sheet, reservoir stress model has three degrees of freedom {σ, r max , H s }; the smoothing length scale, σ, the maximum reservoir fault juxtaposition ratio, r max , and the poroelasticity constant H s . These control the relative influence of elastic heterogeneity (H s ), fault-controlled geometric heterogeneity (r max ) and the resolution length scale (σ) of the stress field induced by the observable reservoir strains and pore pressure changes. These three parameters are optimized jointly with a given seismological model that defines the conditional probabilities of earthquake occurrence and size given the smoothed incremental Coulomb stress field, ΔCðx; tÞ, and the observed catalog of induced earthquakes, D. In the particular case of the Groningen reservoir, the observed history of pore pressure depletion, reservoir compaction strains, and induced earthquake epicenters yield a posterior distribution of thin-sheet stress models that explains the spatial-temporal distribution of seismicity induced by gas production (Bourne & Oates, 2017b) and its response to control measures that reduced and redistributed gas production rates (Bourne et al., 2018) (Appendix A).

Observed Deformation and Seismicity
Natural gas production from the Groningen region in the northeast of the Netherlands began in 1963 and has now produced over 70% of the estimated 2,800 × 10 9 m 3 initial gas in place. This has steadily reduced the initial mean pore pressure by up to 25 MPa within the gas-bearing sediments of the Upper Rotliegend (Permian) and Limburg (Carboniferous) Groups (Stauble & Milius, 1970) that are located 2,600-3,200 m below the surface. The resulting reservoir compaction has steadily increasing surface subsidence up to 400 mm as observed by optical leveling surveys since 1964 (Bourne et al., 2014;Smith et al., 2019) and InSAR since 1995 (Ketelaar, 2009(Ketelaar, , 2008. These reservoir deformations reactivate intrareservoir faults that have been inducing observable seismicity since at least 1991 (Dost et al., 2012) with an exponential-like increase in the cumulative number of earthquakes relative to the cumulative volume of gas produced (Bourne & Oates, 2017b;Bourne et al., 2018). The earthquake catalog has a magnitude of completeness for located events M L ¼ 1:5, since April 1995, with an event detection threshold of M L ¼ 1:0 and typical epicenter and magnitude errors of 500-1,000 m, and 0.1, respectively (see Dost et al., 2012Dost et al., , 2017. Event depths (Smith et al., 2020) and focal mechanisms (Willacy et al., 2019) are predominately consistent with dip-slip on preexisting normal faults within or close to the compacting reservoir. The spatial distribution of epicenters is localized within regions of the reservoir associated with larger incremental Coulomb stresses as represented by the poroelastic thin-sheet model ( Figure 1). Event origin times also appear to favor larger incremental Coulomb stress states ( Figure 2a) as most events occur at later times and in places where incremental stresses are larger albeit subject to considerable variability. Likewise, larger magnitude events, for example, M ≥ 2.5, appear mostly localized in the times and places associated with the largest 20% of the exposed reservoir stress states. The stress dependence of event occurrence probability appears to follow an exponential-like trend consistent with progressive frictional reactivation within a heterogeneous and disordered fault system (Bourne & Oates, 2017b). The observed frequency-magnitude distribution of events ( Figure 2b) shows clear evidence for underreporting of M < 1.5 events and an apparent increase in count variability with increasing magnitude due to finite sample effects.

Frequency-Magnitude Stress Dependence
To investigate stress dependence of the frequency-magnitude distribution without making any assumptions about the particular form of this distribution we will use the Kolmogorov-Smirnov test statistic as previously applied to these data by Harris and Bourne (2017). First, we compute the incremental Coulomb stress, ΔC i , at the origin time, t i and epicentral location, x i of each observed M ≥ 1.5 event from 1995 to 2019, according to the poroelastic thin-sheet reservoir model. Based on these values, we divide the events into two disjoint samples: a low-stress sample, ΔC i < ΔC, and a high-stress sample, ΔC i ≥ ΔC. By increasing the stress threshold, ΔC, we compute the Kolmogorov-Smirnov test statistic p value for all possible divisions of the events ( Figure 3a) and repeat this procedure for alternative minimum magnitudes of completeness in the range 1 ≤ M min ≤ 2 ( Figure 3b).
The smallest p values found are about 10 −3 and correspond to M min ¼ 1:5, and ΔC ¼ 0:63 MPa stress threshold, with about 100 and 200 events in the low-and high-stress samples, respectively. This result appears robust to alternative values of M min such that the 95% confidence threshold is exceeded also for 1.0 ≤ M min ≤ 1.7. For M min > 1.7, the loss of statistical power due to the smaller number of these events likely predominates. Consequently, we conclude there is a statistically significant stress dependence in the frequency-magnitude distribution of Groningen-induced earthquakes. Figures 3c and 3d show the empirical exceedance distribution functions and epicentral map locations for this optimal stress-based division of the observed events.
By simple visual inspection, the different distributions appear consistent with β values decreasing with stress or ζ values increasing with stress. Ergodicity is implicit within this stress covariate hypothesis. That is to say a temporal stress change is indistinguishable from a spatial stress change of the same amount. The separation of high-and low-stress events in space ( Figure 3d) more than in time ( Figure 2a) might indicate the influence of some initial spatial heterogeneity (quenched disorder). However, closer inspection of the map shows spatial mixing with many low-and high-stress events occurring in similar locations. This means there are three distinct spatial domains. A low-stress domain that has never experienced incremental stress above the 0.63 MPa threshold over the period of observation. A high-stress domain that has never experienced incremental stress below the 0.63 MPa threshold over the period of observation. Finally, an intermediate stress domain that has experienced stress states that have crossed the 0.63 MPa threshold at some time during the period of observation.  Gray lines indicate the evolution of stress exposure within the reservoir according to the poroelastic thin-sheet model and denote the reservoir volume fraction exposed to at most that stress state. Most events occur within the largest 20% of the exposed stress states (80-100%). (b) The frequency-magnitude distribution of earthquakes from 1 January 1995 to 1 June 2019 associated with Groningen gas production represented by exceedance counts and event counts within magnitude bins of size 0.1. The magnitude of completeness for this catalog is in the range 1.3-1.5. Any continuous stress dependence of the frequency-magnitude distribution implies both samples still represent a mixture of different distributions reflecting the range of stress states within each sample. In this case, subdivision of the events into more than two disjoint samples fails to reveal any reliable evidence for this which we attribute to the reduction of statistical power, which limits our resolution of this stress dependency under the Kolmogorov-Smirnov nonparametric test statistic.

Statistical Mechanics of Earthquakes
We will now briefly review the statistical mechanics aspects of earthquakes that motivate our choice of possible models that are included in the evaluation. We will use these theories for hypothesis identification and not for hypothesis testing, which we will do using the available observations of Groningen-induced seismicity.
Forecasting failure events within geological materials critically depends on heterogeneity (Vasseur et al., 2015(Vasseur et al., , 2017. Statistical models of failure distinguish themselves from deterministic models by incorporating unresolvable heterogeneities as stochastic disorder. These methods originated with (Weibull, 1939) and now fall within a broad class of statistical models of fractures (e.g., Alava et al., 2006) and earthquakes (de Arcangelis et al., 2016). Within these theories, the frequency-moment power law may be derived in one of at least five different ways.
Likewise, the frequency-moment distribution as a power law with an exponential taper also has a physical basis in at least five different statistical mechanics theories.
1. Within fiber bundle models of brittle failure with equal-load sharing (e.g., Pradhan, 2010); 2. within percolation theory below the percolation threshold (e.g., Stauffer & Aharony, 1994); 3. within Ising models of brittle failure with local-load sharing; 4. within interface theories of crack depinning in the presence of heterogeneity (e.g., Daguier et al., 1997); and 5. within information theory, using the concept of maximum entropy to find the least informative probability distribution subject to observational constraints on the mean magnitude and mean total seismic moment rate (Main & Burton, 1984).
For earthquakes, we will consider the limit where stress redistribution associated with failure events is small relative to the external stress known as damage mechanics. Damage mechanics models exist in two distinct classes. First, network models that address the evolution of failure across a distributed collection of interacting elements. Second, interface models that focus on the advance of a fracture tip line within a heterogeneous medium. These damage models all exhibit power law failure size distributions with an exponential-like stress-dependent taper.
Network damage models take three key forms with respect to failures. Random fuse networks (de Arcangelis et al., 2007;Hansen, 2011;Roux et al., 1988), provide a model of brittle failure within a scalar central force network (Gilabert et al., 2007). Each fuse within the network has a randomly assigned and invariant failure threshold (quenched disorder). Increasing external voltage leads to failure of individual fuses and redistribution of current across the network that potentially triggers additional failures at constant applied voltage. Mean field theory (Toussaint & Hansen, 2006) shows this is a percolation process in the limit of infinite disorder (Roux et al., 1988) where redistributed loads are equally shared. Random spring networks (Nukala et al., 2005) provide a model of brittle failure within a tensor central force network. Here, spring failure under a quenched random strain threshold and forces are redistributed across the remaining spring network. Under simple shear loads, failure within this network is equivalent to random fuse networks. Random block-spring networks (Burridge & Knopoff, 1967) represent frictional failures within a tensor central force network. A network of slider-blocks in frictional contact with a rigid basal surface are connected to each other and to a driver plate by a network of springs. Displacement of the driver plate loads the blocks which slide when the basal shear exceeds the frictional threshold. Basal shear stresses are initiated as a random quenched disorder. Within mean field theory (Sornette & Physique, 1992), the first cycle of failures is equivalent to the fiber bundle model (Hansen & Hemmer, 1994;Hemmer & Hansen, 1992;Kloster et al., 1997;Pradhan, 2010). Toussaint and Pride (2005) demonstrate an isomorphism of the weak lattice damage model with the fiber bundle model, which in turn is isomorphic with percolation theory for equal-load sharing or the Ising model for local load sharing. Using renormalization group theory, Shekhawat et al. (2013) unified the theories of fracturing within a disordered brittle material for infinite disorder (percolation) and zero disorder (nucleation) to show a power law failure avalanche size distribution with an exponential-like taper for finite disorder. Also using renormalization group theory, Coniglio and Klein (1980) demonstrate a correspondence between percolation and Ising models.
Alternatively, interface damage models represent an existing crack front as a deformable line that advances under an external stress through a random toughness medium (e.g., Daguier et al., 1997). This crack front advances episodically between equilibrium states in which heterogeneities temporarily resist crack propagation. The resulting size of crack growth events depends on the competition between distortions of the crack front due to the material's inhomogeneities and the elastic self-stress field that acts to straighten this front (Bonamy & Bouchaud, 2011). Within elastic fracture mechanics theory and in the limit of quasi-static deformations, this crack depinning process leads to failure sizes distributed as a universal power law with a stress-dependent exponential taper (Ponson et al., 2006). This observation that so many diverse statistical failure mechanics models all collapse to the same failure size distribution is remarkable and motivates the application of statistical mechanics to seismic hazard analysis (Main, 1996). In the limit that random prestress variability significantly exceeds both the induced stress loads and earthquake stress transfers, then the frequency distribution of induced earthquake magnitudes may be described by mean field theories within statistical fracture mechanics. These phenomena are not limited to geological materials. A wide variety of physical systems exhibit crackling noise when driven toward failure slowly (Sethna et al., 2001) and the event size distributions are power laws with exponential-like tapers. Also, with regard to fitting observed global natural seismicity, Kagan (2002b) strongly favors a power law with an exponential taper and a universal stress-invariant β value (Kagan, 2002a).

A Generalized Frequency-Moment Distribution
The existing evidence for failure sizes distributed as a power law with a stress-dependent exponential taper within a wide variety of statistical mechanical systems, motivates us to test the hypothesis that this is a useful model of induced seismicity observed within the Groningen region. Following the common form of failure size distributions found within a wide range of statistical mechanics models of brittle failure, we follow Kagan (2002b) and write a generalized distribution for earthquakes according to the seismic moment, M, exceedance probability (survival) function: where M m is the lower threshold for completeness in the observed catalog and β is the power law exponent and ζ is the exponential taper exponent. The corner moment, M c , follows as M c ¼ M m =ζ . We describe this model with ζ rather than M c to better support model inference, as pure power laws correspond to the finite value ζ ¼ 0, rather an infinite corner moment. As expected, for M ¼ M m the exceedance probability is 1.
Within these statistical mechanics models of a highly disordered fault or fracture system being driven by an increasing external load, σ, from stability toward critical instability, β is a universal constant and M c evolves as a power law relative to the system's critical point, σ c , such that Figure 4 illustrates how this survival function evolves with increasing ζ. The maximum likelihood estimator (Aki, 1965), with corrections for the magnitude binning (Bender, 1983;Tinti & Mulargia, 1987;Utsu, 1965) assumes ζ ¼ 0. If this is not true, the estimator becomes biased upward. Figure 5 illustrates this bias using magnitudes simulated according to 11. When ζ scales as a critical point function of external stress then this bias appears as a systematic, nonlinear, and artificial decrease in b values.
The probability density of the power law seismic moment model with an exponential taper may be obtained by differentiating 11 to obtain and the log-likelihood of this model given a set of seismic moment observations, M i ¼ fM 1 ; …; M n g, follows as as previously given by Kagan (2002b). If the observed seismic moments, M i , are computed from moment magnitudes according to 2 and these magnitudes are binned within intervals of size, ΔM, then the minimum seismic moment, M m , must be computed as where M c is the magnitude of completeness above which all events within the region of interest are reliably detected and located. We will use this one general form of the log-likelihood function for the inference and evaluation of all the different possible earthquake magnitude models considered in this study. All that remains now is to specify the functional form of any magnitude stress dependence according to β ¼ βðΔCÞ and ζ ¼ ζ ðΔCÞ.

Apparent Stress Dependence of β and ζ Values
The significant stress-dependent differences found in the observed frequency-magnitude distribution may reflect a decrease in the power law exponent, β, or the exponential taper exponent, ζ, with increased Coulomb stress. To measure any apparent variations of β or ζ values with Coulomb stress, we first ordered the M ≥ 1.5 observed events from 1 April 1995 to 1 June 2019 according to the incremental maximum Coulomb stress at their time and place of occurrence within the poroelastic thin-sheet reservoir   Figure 6a shows the resulting β value estimates and their uncertainties for k ¼ 20 which tend to decrease with increasing Coulomb stress. Within a gradually decreasing trend, an apparent small step-like decrease is evident at ΔC ¼ 0:63 MPa, which is consistent with the previous Kolmogorov-Smirnoff test (Figure 3a), and may indicate mode switching. A gradual evolution of b values due to a mixing of different magnitude distributions associated with different stress states has recently been demonstrated in laboratory data (Jiang et al., 2017) and also observed in volcanic seismicity (Roberts et al., 2016).
Repeating this procedure for a constant ζ value model with β fixed at its presumed universal value ðβ ¼ 2=3Þ yields a similar trend of decreasing values with increasing Coulomb stress, although subject to greater variability about this trend ( Figure 6b). These piecewise constant estimates for the variation of β or ζ values with Coulomb stress depend in detail on the choice of sample size, k. Larger k values reduce uncertainties in the estimated β and ζ values but lower their resolution of any stress dependency. Likewise, smaller k values increase stress resolution at the expense of precision. Nonetheless, similar results were obtained over a wide range of k values indicating an apparent general tendency for β and ζ values to decrease with increasing Coulomb stress under the poroelastic thin-sheet model. Once more, there is evidence for mode switching or mixing under increased stress.

Model Specifications
We will now specify four distinct and physically plausible model classes to describe the possible functional forms of any magnitude stress dependence: stress-invariant magnitudes, stress-dependent β values, stressdependent ζ values, and stress-dependent β and ζ values.

Stress-Invariant Distributions
This class of stress-invariant models has up to 2 degrees of freedom, {β, ζ} where the log-likelihood function 14 takes the special case where Also of interest are two special cases with single degrees of freedom. The first case is an unknown invariant power law with zero taper, specified as The second case is an unknown invariant taper with a known universal power law, such that These basic invariant magnitude-frequency models are all unable to explain the significant difference observed between the low-and high-stress partitions of the Groningen earthquake catalog (Figure 3).

Journal of Geophysical Research: Solid Earth
Nonetheless, they provide useful performance references for the following two alternative classes of stress-dependent models.

Stress-Dependent β Values
Within this class of models we represent the stress dependence of the frequency-magnitude distribution according to 11 given ζ i ¼ 0 and where ΔC i is the maximum incremental Coulomb stress state at the occurrence time, t i , and epicentral location, As a first possible parameterization of f(ΔC i ), we will consider an inverse power law of the following form: To avoid implausibly large β values, we include the constraint β i ¼ min ðβ i ; 1Þ . This model has 4 degrees of freedom {θ 0 , θ 1 , θ 2 , θ 3 }, where θ 1 , θ 2 , θ 3 are nonnegative. In general, β values decrease with increasing Coulomb stress to the lower bound θ 0 . This model has an asymptote at ΔC i ¼ θ 1 and so its range of physical validity is restricted to ΔC i > θ 1 . The scale and shape of the stress dependence are given by θ 2 and θ 3 , respectively, which are both restricted to be nonnegative. For θ 3 ¼ 0 the model is stress invariant, and for θ 3 ¼ 1 the model is a linear function of ΔC i equivalent to the theoretical model proposed by Scholz (1968).
We will also consider an alternative parameterisation of f(ΔC i ) to represent a smooth step-like transition from an upper to a lower bound with increasing stress without increasing the degrees of freedom. This is motivated by Figure 6a and previous observations of apparent mode switching in volcanic seismicity (Roberts et al., 2016).
In this case, the smallest and largest possible β values are bounded such that β min ¼ θ 0 , and the largest possible decrease in the β value with increasing stress is β max − β min ¼ 1 2 θ 1 . The shape and location of this smooth step down in β values are governed by θ 2 and θ 3 , respectively. The observable performance of these two stress-dependent β value models is not greatly sensitive to these alternative parameterization choices as they both represent a smooth nonlinear approach to a lower bound. These models do differ in their extrapolation to earlier times with lower stresses as only the second model has an upper bound. However, under extrapolation to later times with higher stresses the two models become equivalent as they approach a common lower bound. For seismic hazard and risk analysis we only require this second type of extrapolation.

Stress-Dependent ζ Values
Within this alternative class of stress-dependent models we represent the stress dependence of the frequency-magnitude distribution according to 11 given β i ¼ β and ζ i ¼ f ðΔC i Þ. As a first parameterization, we model the stress dependence of ζ according to a critical point power law scaling motivated by statistical fracture mechanics (e.g., Alava et al., 2006), such that

10.1029/2020JB020013
Journal of Geophysical Research: Solid Earth where θ 3 is the critical stress of the system corresponding to the divergence of failure correlation length scales and the onset of global failure. The θ 2 is the nonnegative critical exponent of this power law, and θ 1 is a proportionality constant. So, as ΔC → θ 3 , then ζ → 0. This means seismic moments initiated under critical stress states are distributed as a power law, whereas subcritical stress states involve power laws with an exponential taper. Within this model, the power law exponent, β, is a constant while the strength of the exponential taper decreases as stress states approach the critical point, as previously argued by Main (1995Main ( , 1996. Given this parameterization choice, θ 1 ¼ 0 corresponds to the power law distribution without any exponential taper, and θ 2 ¼ 0 corresponds to an exponential taper independent of the stress state. This model has 4 degrees of freedom {θ 0 , θ 1 , θ 2 , θ 3 }. The joint posterior distribution of these parameters given the Groningen events and stress model exhibit a trade-off between parameters. This may be avoided by fixing θ 1 to its maximum a posterior probability (MAP) value, but doing so may also inadvertently bias the model.
Motivated by these findings, we consider an alternative positive definite parameterization of the ζ stress function with just 2 degrees of freedom, which still allows for rapid decrease of ζ values with increasing stress toward the critical point (ζ ¼ 0) in the form of an exponential trend: This alternative model has 3 degrees of freedom {θ 0 , θ 1 , θ 2 }. With this parameterization choice, θ 1 ¼ 0 corresponds to a pure critical-state power law with no exponential taper for all stress states as also postulated in Main (1995Main ( , 1996.
Then for θ 1 > 0, and θ 2 ¼ 0, the exponential taper is present but independent of the stress state. If both parameters are nonzero, then the exponential taper depends on the stress state, and for θ 2 > 0 it follows that ζ → 0 as ΔC → ∞. So we see that this reduced parameterization is equivalent to the previous power law choice in the limit that the critical stress point is much larger than the presently observed stress states. Although Taylor expansion of the power law 21 under these conditions leads to a linear trend, that is, ζ i ¼ θ 1 þ θ 2 Δ C i , this is not guaranteed to be positive definite without an additional constraint that creates a discontinuity in the first derivative leading to increased instability during inference. This linear form also lacks the requirement for nonlinear growth in ζ with increasing subcritical stress states. For these reasons we do not include an explicit linear parameterization for the stress-dependence of ζ values.

Stress-Dependent β-ζ Values
Within this hybrid class of models we consider a five-parameter combination of the hyperbolic tangent stress-dependent β model and the exponential stress-dependent ζ model defined here as Joint inference of the model parameters {θ 0 , θ 1 , θ 2 , θ 3 , θ 4 } in principle allows for competition between the two paradigms of stress-dependent β with ζ ¼ 0 and stress-dependent ζ with some universal fixed β. In practice, the limited number of observed events, the uncertainties in their magnitudes and reservoir stress states, and limited sampling of higher stress states and event magnitudes may critically limit the statistical power of this most complex model.

Model Inference
Adopting the established methods of Bayesian inference, we will estimate the set of parameters, Θ i , for each of the specified models, M i . Although models and magnitudes are both denoted by the same symbol M, they 10.1029/2020JB020013 Journal of Geophysical Research: Solid Earth may be distinguished as models are always associated with an integer subscript, M i , whereas any magnitude subscripts are restricted to M c and M t representing the completeness and threshold magnitudes respectively (Table 1), given the observed earthquake data set, D. According to Bayes theorem, where PrðΘ i jD; M i Þ≡PðΘ i Þ is the posterior probability distribution of the model parameters, PrðDjΘ i ; M i Þ≡ LðΘ i Þ is the likelihood distribution, PrðΘ i jM i Þ≡πðΘ i Þ is the prior distribution of parameter values, and PrðDjM i Þ≡Z i is the normalization factor or Bayesian evidence. As Z i is independent of Θ i , it may be ignored for the purposes of model inference. Using standard MCMC methods provided by the Python library PyMC3 (Salvatier et al., 2015), we sample each model's parameter space distributed according to its unnormalized posterior distribution using equilibrium Markov chains. This sampled posterior distribution constitutes a complete joint inference of all parameter values and may be marginalized over each parameter to yield individual parameter value estimates.
Relative to earlier studies (Bourne & Oates, 2017b;Bourne et al., 2018), our MCMC sampling methods incorporate three improvements. First, the adaptive Metropolis Hastings sampler was replaced with the No-U-Turn (NUTS) sampler that provides automatic tuning of the Hamiltonian sampler and uses symbolic derivatives of the likelihood function to improve sampling efficiency and reduce correlations between successive samples. Second, single trace sampling was replaced by multiple independent trace sampling in parallel on multiple CPU and, when possible, GPU cores. Third, sample chains are initiated by random draws from the prior distribution, π(Θ i ), rather than at the parameter values that maximize the posterior distribution, P(Θ i ). This last change avoids sampling bias and assists confirmation of sample repeatability between the independent Markov chains.
In addition, the earthquake data set, D, incorporates two improvements relative to Bourne et al. (2018). First, the seismological survey, KNMI, reduced the rounding of reported earthquake magnitude values from 0.1 to Stress-dependent, critical point scaling ζ models M 9 21 Θ 9 ¼ fθ 0 ; θ 1 ; θ 2 ; θ 3 g ets0.cpz4 M 10 (21|θ 1 = 10 −4 ) Θ 10 ¼ fθ 0 ; θ 2 ; θ 3 g ets0.cpz3 Stress-dependent, exponential trend ζ models M 11 22 Note. The labels are composed by a string to represent a model type and a following digit to denote the associated degrees of freedom. For instance, ets0.htc3 denotes the MAP elastic thin-sheet model with 0 degrees of freedom combined with the hyperbolic tangent of incremental Coulomb stress model with 3 degrees of freedom to represent a stress-dependent β value. 0.01. Second, the observed time period increased by 18 months from 1 January 1995 to 1 January 2018 to 1 January 1995 to 22 May 2019 (6% increase), to incorporate another 20 M ≥ 1.5 events within the Groningen catalog (7% increase).
For model inference from these data, we aim to use uninformative uniform prior distributions that honor nonnegative conditions where applicable. The range of these distributions are sufficiently large such that further increases do not influence the posterior distributions.

Stress-Invariant Models
We trained the power law distribution with an exponential taper model with constant β and ζ values as specified by 16 with uniform prior distributions: 0.3 ≤ β ≤ 1, and 0 ≤ ζ ≤ 1. The resulting marginal and joint posterior probability density (Figure 7, Figure 8a) indicates a β value consistent with its typically observed value, β ¼ 2=3, and a nonzero ζ value consistent with the presence of an exponential taper on the power law distribution of seismic moments within the Groningen catalog. The posterior distribution obtained is characterized by the following mean values and 95% credible intervals defined by the highest posterior density (HPD): β ¼ 0:64 ð0:56 < β < 0:71Þ ζ ¼ 1:2 × 10 −3 ð3:5 × 10 −5 < ζ < 2:5 × 10 −3 Þ: This is consistent with the usually observed value of β ¼ 2=3 and the presence of an exponential taper (ζ > 0). The joint posterior probability density distribution (Figure 8a) indicates no evidence for any strong covariance between the inferred β and ζ values that would appear as a clear diagonal trend in the distribution. That is lower than average β values are equally likely to be paired with lower or higher than average ζ values and vice versa.
The posterior distribution obtained for this model, is characterized by the following mean values and 95% credible intervals (HPD).
θ 0 ¼ 0:49ð0:33 < θ 0 < 0:63Þ θ 2 ¼ 0:49ð0:28 < θ 2 < 0:63Þ θ 3 ¼ 5:77ð1:96 < θ 3 < 10:0Þ: This apparent variation with stress may be a statistical artifact of neglecting stress variations in ζ as illustrated in Figure 5. The posterior ensemble β function of incremental Coulomb stress (Figure 9a) is consistent with the previous finding of a significant difference between the frequency-magnitude distribution of events occurring under stress states below and above ΔC ¼ 0:63 . Joint optimization of this magnitude-frequency model and the poroelastic thin-sheet model with its three degrees of freedom (σ, r max , H s ) yields a similar ensemble function (Figure 9c) albeit with a broader prediction interval reflecting the additional variabilities within this ensemble stress model.

Hyperbolic Tangent β Model
The posterior distribution of parameter values for the hyperbolic tangent β model specified according to 20 was sampled subject to uniform prior distributions, 1/3 ≤ θ 0 ≤ 1, 0 ≤ θ 1 ≤ 2.5, 0 ≤ θ 2 ≤ 5, and θ 3 ¼ 0. The joint posterior probability density distribution (Figure 8c) once again indicates a bounded distribution with singular MAP values. The posterior distribution obtained for this model, is characterized by the following mean values and 95% credible intervals (HPD).

Journal of Geophysical Research: Solid Earth
Under this alternative parameterization of the stress-dependent β model, the correlation structures between the parameters do differ but lead to similar evidence of apparent stress dependence. The associated ensemble β function of stress (Figure 9b) appears broadly similar to the inverse-power law model, with the largest differences limited to the lowest stress states. We attribute this to sampling bias as the observed events are significantly more prevalent under the higher stress states leaving few observations to constrain this low-stress response. Joint optimization of this magnitude-frequency model with the thin-sheet stress model leads to similar results once more (Figure 9d), and again with increased variability associated with counting the uncertainty in our knowledge of the stress states associated with each event.

Stress-Dependent ζ Models 11.3.1. Power Law ζ Model
For the power law ζ model specified by 21, and given the constraint θ 1 ¼ 1, the posterior distribution obtained (Figure 8d) is characterized by the following mean values and 95% credible intervals (HPD).
The posterior distribution of θ 3 takes values that are mostly larger than ΔC i corresponding to ζ > 0 reflecting the presence of an exponential taper to the power law distribution of seismic moments. Furthermore, as the 95% confidence interval for θ 2 excludes θ 2 ¼ 0, there is significant evidence for ζ decreasing with increasing Coulomb stress in accord with the critical point scaling laws of statistical fracture mechanics.

Stress-Dependent β-ζ Models
Within the hybrid model that combines both β and ζ stress dependence as defined by 22,

Out-of-Sample Likelihoods
We favor out-of-sample over in-sample likelihoods as a better measure of the forecast performance required by seismic hazard and risk analysis. This overcomes the inherent risk of overfitting and comparing models with different degrees of freedom. Typical hazard and risk analysis periods for Groningen-induced seismicity are 5 to 10 years and are always beyond the current observation period (van Elk et al., 2019). This means seismicity forecasts rely on near-term extrapolations of the seismological models conditioned on a given future gas production scenario. We therefore choose to exclude all in-sample model evaluation methods, such as the Bayesian Information Criterion, as these do not properly reflect this out-of-sample forecast requirement.
The most reliable measure of forecast performance is a blind prediction made prior to the observations required to evaluate forecast performance. In our case this would mean waiting at least 5 years. To avoid such a delay, we evaluate out-of-sample model performance using the existing observations. To do this, we divide the observed earthquake data set, D, into two disjoint, time-contiguous parts, D 1 and D 2 , corresponding to a training period, T 1 , and an evaluation period, T 2 . In this study, when not specified otherwise, these periods are T 1 ¼ 1 January 1995 to 31 December 2012, and T 2 ¼ 1 January 2013 to 1 June 2019. This choice splits the data into approximately two equal parts and also ensures the evaluation period covers at least 5 years to represent the typical forecast demand for these seismological models. This is a form of cross validation where the choice of out-of-sample data is restricted to reflect the forecast requirement. This is a retrospectively blind test where the choice of the start time for the out-of-sample future events was made prior to, and independent from, the later analysis. Nonetheless, there remains a residual possibility of unconscious researchers' bias influencing our analysis. Indeed, true forecast performance typically lags behind hindcast performance within meteorological models.
To evaluate out-of-sample model performance, we first sample the posterior joint model parameter distribution, P(Θ i ), given D 1 according to 24. Then we sample the out-of-sample posterior predictive distribution of likelihood values, L(D 2 |Θ i ), for the D 2 data set given the sampled posterior distribution P(Θ i ) obtained in the first step. These results are summarized by the distribution of log-likelihood values evaluated as By this measure, every model has 0 degrees of freedom to explain the out-of-sample observations, D 2 , as it is not fitted to these data. The D 2 observations correspond to the period of interventions where gas production rates were reduced and redistributed away from the central region of greatest seismicity to test the causal relationship between gas production and induced seismicity, to evaluate the ability of seismological models to forecast the seismicity response to these control measures, and to lower the expected seismic hazard and risk.
Models with too many degrees of freedom will tend to yield posterior distributions that overfit the in-sample observations, D 1 , with highly variable parameter values. This likely increases bias and reduces precision in model-based forecasts for the out-of-sample observations, D 2 , which systematically reduces the out-ofsample log-likelihood values obtained according to Equation 33. Likewise, models with too few degrees of freedom, will likely fail to fit enough of the observed variations within D 1 and carryover this deficiency.
Limitations associated with small sample sizes may confound this evaluation due to chance effects that increase performance variability and broaden the measured out-of-sample log-likelihood distribution. This limits our ability to reliably rank the model when their log-likelihood distributions overlap.
Instead, we use these distributions to measure the probability, P ij , of one model, M i , outperforming another model, M j , according to the probability of ℓ i exceeding ℓ j : This probability P ij is estimated by the fraction of randomly sampled pairs from their respective distributions that satisfy this criterion. Posterior distribution sample sizes are made large enough to ensure sampling 10.1029/2020JB020013

Journal of Geophysical Research: Solid Earth
errors, ΔP ij are insignificant when comparing models (e.g., ΔP ij < 0.01). This was verified by increasing the sample size to demonstrate the results at this level of precision are reproduced. Accordingly, self-comparison of any model yields P ii ¼ 0:5. Figure 12a shows the out-of-sample log-likelihood distributions obtained for the three stress-invariant models. Better performance appears as larger log-likelihood values so the best and worst versions of a model are found in the upper and lower tails of these distributions, respectively. As the distributions all overlap the ranking of model performance is somewhat ambiguous. So although the best performances are associated with the upper tail of the M 3 , the upper tails of M 1 and M 2 still exceed the performance of the lower tail of M 3 . Nonetheless, the two models that allow for the presence of an exponential taper, ζ ≠ 0 (M 2 , M 3 ), are both capable of better performance than the baseline model without any exponential taper ζ=0 (M 1 ) as shown by the locations of their upper tails. Likewise, all stress-dependent models also exhibit better performance than the baseline model M 1 (Figure 12b) as their upper tails all exceed the upper tail of M 1 . However, within these models the performance gain of an exponential taper appears much less clear as the three models with the best performing upper tails include two with ζ ¼ 0 (M 5 , M 7 ), and one that combines stress-dependent β and ζ effects (M 13 ).
The complete D 2 data are dominated by the smallest magnitude events, so, for instance, half of the observed events are in the range 1.5 ≤ M ≤ 1.8 compared to the largest observed magnitude at M ¼ 3:6. Since these models are intended for probabilistic seismic hazard and risk assessment their performance in forecasting larger magnitude events must be considered. We start to do this by increasing the magnitude threshold, M t , for the events admitted into the D 2 data set to obtain the subset D 2t . Then the out-of-sample likelihood analysis is repeated using the same posterior distributions of parameter values as before, P(Θ i ), to evaluate the out-of-sample likelihood values L t (D 2t |Θ i ). The modified likelihood function, L t , is given by Equations 14 and 15 where M c ¼ M t . In this manner the models are still trained by all M ≥ 1.5 events within the training data but then evaluated only on the larger M ≥ M t events within the out-of-sample evaluation data. Figure 13 shows the likelihood distributions obtained for magnitude thresholds M t ¼ f1:75; 2:0; 2:5g. Once more, the better performing models are located within the upper tails of each distribution. Table 2 summarizes the performance of all models relative to the baseline model, M 1 , according to the P i1 metric as specified by 34. As the magnitude threshold increases, it is clear that the performance of ζ=0 models significantly decreases from a top-ranked performance for M ≥ 1.5 to a bottom-ranked performance for M ≥ 2.5. Furthermore, the only models that fail to exceed the baseline model performance (P i1 ≤ 0.5) are those with a stress-dependent β values (M 5 , M 7 , M 13 ). This indicates that β values, which decrease with increasing Coulomb stress do not describe the tail of the observed magnitude distribution as well as any of the other models which all possess stress-invariant β values. In contrast, the performance of ζ ≠ 0 models with constant β values either improve (M 2 , M 3 ) or remain stable (M 10 , M 11 ) under increasing magnitude thresholds. As expected, the presence of an exponential taper measurably improves the out-of-sample forecast performance for M t ≥ 2 events. However, within this analysis, there is no evidence for stress-dependent ζ values as stress-invariant ζ ≠ 0 models perform marginally better against the baseline model for M t ≥ 2.

Simulated Seismic Moments and Magnitudes
Simulation of event catalogs using the different magnitude-frequency models allows their performance to be evaluated regarding the time series of maximum magnitudes and total seismic moment time series. Such an evaluation differs from the previous consideration of out-of-sample likelihood given the observed Figure 13. Out-of-sample forecast performance of each magnitude-frequency model (Table 1)   Note. Magnitude models were trained using the MAP stress model and the observed January 2012 to June 2019 M ≥ 1.5 events and evaluated using the observed January 2012 to June 2019 M ≥ M t events, where M t ¼ f1:5; 1:75; 2:0; 2:5g. Colors vary from red to yellow to green denoting probabilities from 0 to 0.5 to 1, respectively. magnitudes by testing the simulation results and placing greater emphasis on forecasting the larger magnitudes that most influence seismic hazard and risk analysis. The time series of total seismic moments represents the cumulative sum of seismic moments for all prior M ≥ 1.5 events. Likewise, the time series of maximum magnitudes represents the largest magnitude observed so far. These simulated time series depend on both the simulated number and magnitude of events. Over the typical range of β values associated with this seismicity most of the total seismic moment is contributed by the maximum magnitude event. As such, both time series are closely related, but with one key distinction. The total seismic moment in 1995 is unknown whereas the maximum magnitude is known to be M max ¼ 2:4 from a regional monitoring network reporting all M ≥ 2 events. Figures 14 and 15 compare the observed and simulated time series of maximum magnitudes and total seismic moments for two stress-invariant and three stress-dependent magnitude-frequency models. All these results share the same event occurrence simulations based on the posterior distribution of Extreme Threshold Failure models (Bourne & Oates, 2017b). The posterior distributions of all models were obtained using just D 1 so the out-of-sample observations in this case occur prior to 1 January 1995 and from 1 January 2013. The simulations were run from 1965 to 2019 using the reservoir pore pressure model.
The model of stress-invariant β values given ζ = 0 (M 1 , uni0.b1) systematically overpredicts both time series for all observed events and exceeds the 95% prediction interval for maximum magnitudes. Maximum magnitude time series residuals (Figure 16) indicate the absolute mean residual (ΔM max = −0.5) is significantly larger than the expected magnitude measurement error (±0.1-0.2). The upper bound of the 95% prediction interval is always about 2 magnitude units above the observed maximum magnitude. A similar overprediction bias is seen in the total seismic moments time series where the median time series always exceeds the observed total seismic moment after 1996. Including a stress-invariant taper of the frequency-magnitude distribution (M 3 , uni0.b1z1) significantly reduces the simulation bias while also significantly increasing its precision. This is shown by the reduced width of the 95% prediction interval that still contains all the variability in the observed total seismic moment time series although the first half of the time series (1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007) is systematically overpredicted. This early overprediction bias is also seen in the maximum magnitudes time series and even exceeds the 95% prediction interval. The appearance of increasing precision with time does not reflect the increase in observational constraints but instead appears due to the influence of the stress-invariant exponential taper that starts to significantly lower the probability of M ≥ 3.5 events.
The inverse power law model for stress-dependent β values with ζ=0 (M 5 , ets0.ipb3), exhibits the same systematic tendency for overprediction of both maximum magnitudes and total seismic moments albeit to a lesser extent and without exceeding the 95% confidence interval. The absolute mean maximum magnitude bias (ΔM max ¼ 0:4) is still significant relative to the magnitude measurement errors. In contrast, the exponential stress-dependent ζ model with β ¼ 2=3 (M 11 , ets0.etz3) exhibits no bias in either maximum magnitudes or total seismic moments and does not exceed the 95% prediction interval despite this interval being significantly smaller than the previous two models. Moreover, the observed variability approaches both the upper and lower bounds of the simulated variability. As such this model demonstrates zero bias and a simulated variability consistent with the observed variability. Figure 17 shows the distributions of out-of-sample likelihood, L s (D 2t |Θ i ), for observed maximum magnitudes, D max, 2 , given the simulated maximum magnitude time series. We obtained the maximum magnitudes data set, D max by selecting the subset of events within the complete data set, D, that are larger than all previous events. This data set is then partitioned according to event origin times to yield the out-of-sample maximum magnitudes data set D max, 2 . We estimate the out-of-sample likelihoods, L s , using the distribution of simulated maximum magnitude time series according to the posterior model distribution, P(Θ i |D 1 ). We performed these simulations in a nested manner to yield simulated maximum magnitude time series, S ijk , where i denotes the model, j denotes a single random sample from the posterior distribution, P(Θ i |D 1 ), and k denotes the simulation index. In this manner, time series S ijk represents the kth simulation of the jth posterior sample from the ith model.
For each i and j, and observed event within D max, 2 , we select the set of simulated maximum magnitudes at the observed origin times. Using a Gaussian kernel density estimate for the probability density function of these simulated maximum magnitudes, we compute the likelihood of this observed maximum magnitude.
Repeating this for all D max, 2 events L s is estimated as the product of these single-event likelihood values.
Repeating this for all values of j yields the posterior distribution of likelihood value for model M i . Repeating all of these steps for all values of i results in the collection of likelihood distributions shown by Figure 17. Better performing models appear with likelihood distributions located to the right of poorer performing models. However, as all these distributions overlap with different tail shapes the relative performance ranking is not completely clear. Nonetheless it is clear the top two models both include stress-dependent ζ values.
As these distributions substantially overlap, we summarize the overall relative performance according to the probability of one model yielding a better out-of-sample likelihood than another model, P ij , according to 34. Table 3 shows the pairwise probabilities P ij that model M i outperforms model M j given their respective outof-sample maximum magnitude likelihood distributions. If the models are listed in rank order of performance then this matrix, P ij would show monotonically increasing values as i increases (moving top to bottom in each column of Table 3) and as j decreases (moving right to left in each row of Table 3). In this  case, the results are less clear in the sense that no such complete and unambiguous ranking exists. This is because the relative performance of several middle-ranking models are so similar, likely due to the small out-of-sample size available. Nonetheless, we may more confidently identify the best performing models.
The baseline model (M 1 ) is outperformed by all other models, although none exceed 95% probability, although two models (M 11 and M 13 ) are close with a 94% chance of exceeding the baseline performance. The only common feature of these two models is a stress-dependent ζ variation. One stress-dependent β model (M 7 ) does indicate a 91% chance of exceeding the baseline performance, but its chances of outperforming the leading two models are just 25% and 19%, respectively. Ranking all models by increasing performance based on this metric yields {M 1 , M 2 , M 3 , M 5 , M 10 , M 7 , M 11 , M 13 }. This is essentially the numerical model sequence shown in Table 3 except the best performing stress-dependent β model, M 7 , changes places with the worst performing stress-dependent ζ model, M 10 . We attribute this to a poor parametrization choice for M 10 resulting in a large trade-off between the θ 2 and θ 3 parameters shown in Figure 8d associated with insufficient information to constrain the location of the critical point, θ 3 .

Probabilities of Larger Magnitudes
Central to the analysis of seismic hazard and risk induced by Groningen gas production is the reliable forecasting of larger than previously experienced magnitudes. Previous probabilistic analyses for Groningen show magnitudes in the range 4.5-5.5 make the largest contribution to seismic hazard and risk metrics associated with public safety within this built environment (Bourne et al., 2015;van Elk et al., 2019). Smaller magnitudes are too small and larger magnitudes are too infrequent to influence the hazard and risk metrics.
Using the gas production history and a single future gas production scenario (2019 GTS Raming), we simulated earthquake catalogs for the entire history of gas production , and the next 5 years of future gas production (2019-2024). Earthquake occurrence was simulated using the posterior distribution of Extreme Threshold Failure models (Bourne & Oates, 2017b)   Note. Colors vary from red to yellow to green denoting probabilities from 0 to 0.5 to 1, respectively. By definition, P ij þ P ji ¼ 1, so the above diagonal cells contain the same information as their below diagonal counterparts.
magnitudes are an important model performance metric as they typically form part of traffic light systems used to trigger interventions as is the case for Groningen-induced seismicity.
Simulation-based forecasts for the next 5 years yield different distributions for the maximum magnitude event all with the same mode M ¼ 3:2 and similar medians (Table 4, 50%, M = 3.2-3.5). However, hazard and risk are driven by larger, less likely magnitudes in the upper tail of these distributions and differences between these upper tails are considerable. Table 4 shows models with an exponential taper (M 2 , M 3 , M 10 , M 11 , and M 13 ) exhibit much lower magnitudes with a 1% chance of exceeding (3.9-4.5) than the other models (M 1 , M 5 , M 7 , and M 11 ) that lack an exponential taper (5.3-5.5). This is a difference of 0.8-1.6 in magnitude. Forecasts over longer periods necessarily face increasing uncertainties associated with larger extrapolations of the pore pressure depletion model given the observed depletion history, and also larger extrapolations of the seismological model given the observed seismicity history. Limiting forecast periods to 5 years or less limits exposure to extrapolation related uncertainties.

Including an Upper Bound
The magnitudes models, as formulated so far, do not include an upper bound, corresponding to a maximum possible magnitude, M max . For  Note. These results are based on the 2019 GTS Raming production scenario.
any finite system there must be a finite limit on the magnitude of earthquakes within that system. This quantity is not directly observable within the Groningen earthquake catalog, but we are still able to incorporate a prior distribution for M max within the magnitude-frequency models. Following Cornell and Van Marke (1969), the survival function of a magnitude distribution may be truncated to reflect some prior belief in a maximum possible seismic moment, M max , providing an upper bound to the distribution, according to where the untruncated survival function PðMÞ is given by 11. In the case of Groningen-induced seismicity,  reported a prior discrete distribution of maximum possible magnitudes based on collective expert judgments with a 3.75-7.25 range and a 4.8 median and a 5.0 mean. For the most part, the influence of the posterior distribution of exponential tapers to the frequency-magnitude distribution occurs at significantly lower magnitudes than this prior distribution of M max values.
For the data analyzed in this study, incorporating stress-dependent exponential tapering of the power law seismic moment distribution alongside an upper bound in earthquake magnitude-frequency models used for probabilistic hazard and risk analysis of induced seismicity within the Groningen gas field reduces bias that may otherwise in this case overstate the hazard and risk. Utilizing data-driven, stress-dependent ζ models also reduces the impact of imposing a maximum possible magnitude based on an expert judgment. This procedure is therefore more robust to possibility of expert bias. Figure 19 illustrates the influence of including the possibility of a stress-dependent ζ model for induced earthquake magnitudes on probabilistic seismic hazard analysis for the Groningen field. Figure 19a shows the seismic hazard map obtained using the best performing stress-dependent β values as the only model for forecasting induced earthquake magnitudes (M 5 , ets0.ipb3) consistent with the previous hazard analysis (Bourne et al., 2015;van Elk et al., , 2019. For comparison, Figure 19b, shows a probability weighted-mean seismic hazard map that includes both the best performing stress dependence β model (M 5 , ets0.ipb3) and the best performing stress-dependent ζ model (M 11 , ets0.etz3). In this case, we treat model selection as an epistemic uncertainty and include each independent model class as different branches on a logic tree of alternative Monte Carlo simulations. To do this, we use the evidence-based weight factors given by Table 3. So, the mutually independent and collectively exhaustive model set {M 1 , M 2 , M 7 , M 11 } would be represented by logic tree branch weights {0.04, 0.08, 0.18, 0.7}. Given, the small weights attached to the first two models, and the need to limit the computational cost of probabilistic seismic hazard and risk assessments we truncate this to the top two models {M 7 , M 11 } by setting the weights of the excluded models to zero and renormalizing the remaining weights to obtain the model weights {M 5 : 0.2, M 11 : 0.8}. Even in the presence of a distribution of upper bounds to the magnitude distribution that starts at M max ¼ 3:75, the implications of including an exponential taper to the pure power law is still significant. The maximum seismic hazard is reduced by 31% from 0.174g to 0.130g.

Conclusions
Induced earthquake magnitudes observed within the Groningen gas field exhibit evidence of stress dependence. The preexisting reservoir fault system is progressively reactivated under increasing incremental Coulomb stress loads due to gas production that reduces reservoir pore pressures and increases compaction strains. The timing, location, and magnitude of seismic fault slip are consistent with the reactivation of a highly disordered fault system characterized by a large, stochastic prestress distribution. If this prestress variability significantly exceeds the induced stress loads, as well as the earthquake stress transfers, then the space-time-size distribution of induced earthquakes may be described by mean field theories within statistical fracture mechanics. Under these idealized conditions, the frequency-moment distribution of induced earthquakes will deviate from a pure power law with an exponential-like taper that reduces the expected magnitude. Initial failures within such a strongly disordered fault system will be limited to locations exposed to the extreme values in the upper tail of the prestress distribution. These rare locations will then most likely be surrounded by lower prestress states that will limit the size of that failure. As gas production proceeds induced incremental stresses increase steadily and failures initiate closer to the body of the prestress distribution. This increases both the rate of induced seismicity and the expected magnitudes as both the number and connectivity of critically stressed fault segments rises in a nonlinear manner.
To test the hypothesis that Groningen-induced seismicity originates within such a strongly disordered fault system, we evaluated the out-of-sample forecast performance of a wide range of stress-invariant and stress-dependent magnitude models. These include stress-invariant power law models characterized by the power law exponent, β, and the characteristic value of an exponential taper, ζ, where the corner seismic moment of this taper is M c ¼ M m =ζ relative to the minimum moment, M m . Potential stress dependence was evaluated using models of stress-dependent β values with no taper (ζ ¼ 0), and models of stress-dependent ζ values with constant β, and also a hybrid model that combines both β and ζ stress dependence.
The stress-dependent ζ models with constant β (M 10 and M 11 ) offer higher performance magnitude-frequency forecasts than the stress-dependent β models with ζ ¼ 0 (M 5 and M 7 ) 75-85% of the time (Table 3) and lower the magnitude with a 10% and 1% chance of exceedance from 4.3 to 3.8 and from 5.5 to 4.5, respectively. Likewise, stress-dependent ζ models outperform stress-invariant ζ models (M 2 , M 3 ) about 90% of the time, although in this case the stress dependence of ζ increases the magnitude with 1% chance of exceedance from 3.9 to 4.4-4.5. The hybrid model with stress-dependent β and ζ values, M 13 , includes all these possibilities in one joint posterior distribution, resulting in a 1% magnitude of 4.6, which is much closer to the stress-dependent ζ models than any of the other frequency-magnitude distributions (Figures 18c and 18d).
The stress-dependent exponential taper power law model introduced here likely offers better forecast performance and better represents the physical processes of failure size distributions within a highly disordered heterogeneous geological material experiencing the onset of induced seismicity due to increasing stress loads. The limited sample size of Groningen earthquakes means we cannot be definitive in our preference for a single frequency-magnitude model. Instead, we represent our currently limited knowledge using a range of different models weighted by their measured performance evidence rather than expert judgment. Over time, further earthquake observations within Groningen, other analogue fields, or laboratory experiments may be decisive.

Appendix A: Poroelastic Thin-Sheet Model Inference
The poroelastic thin-sheet model used to describe vertically averaged, intrareservoir changes in the maximum Coulomb stress states induced by pore pressure changes (section 5) depends primarily on the observed distribution of pore pressure changes, ΔP(x, t) and reservoir strains, ϵ zz (x, t). To further refine the 10.1029/2020JB020013

Journal of Geophysical Research: Solid Earth
performance of this model, there are three degrees of freedom that allow for minor adjustments to the modeled stress field. These parameters govern the length scale of stress field smoothing, σ, the filtering of fault offset contributions to local stress perturbations, r max , and a poroelastic skeletal modulus, H s . Groningen earthquake occurrence within this induced stress field may be described a stress-dependent Poisson Point Process (Bourne & Oates, 2017b;Bourne et al., 2018). Figure A1a shows the optimization results for the thin-sheet stress model using the observed origin times, t i , and epicenters, x i , and their modeled incremental maximum Coulomb stress states, ΔC(x i , t i ), according to the thin-sheet stress Equation 4. The log-likelihood, ℓ, of n-observed induced earthquake occurrences under these stresses is modeled as an extreme threshold failure within the tail of a stochastic prestress distribution according to (Bourne & Oates, 2017b, Equation 65) where S and (t a , t b ) are the spatial domain and time interval of the observations, respectively, and the intensity function of this Poisson Point Process, λ, is defined as λðx; tÞ ¼ β 0 e β 1 ΔC ðx; tÞ : The smoothed thin-sheet incremental maximum Coulomb stress, ΔCðx; tÞ, is computed according to 9. The activity rate model parameters β 0 , β 1 describe an exponential relationship between Poisson intensity and incremental stress. 10.1029/2020JB020013 Figure 20a shows the marginal posterior distributions of these parameters after training the remaining degrees of freedom within the stress model in combination with the extreme threshold activity rate model, M ar (ets3.eta2), using the origin times and epicenters of observed earthquakes. This indicates an optimal smoothing length scale of σ = 3,000 m. This represents a balance between overfitting and underfitting of the model to the training data and is commensurate with the reservoir depth that governs the lateral resolution of reservoir strains inferred from the observed surface displacements (Smith et al., 2019). The optimal fault filtering, r max ¼ 0:4, corresponds to excluding faults with offsets that reduce reservoir juxtaposition by 40%. We interpret this to indicate faults with larger offsets exposed to reservoir compaction are more likely to undergo aseismic slip due to increased exposure to the overlying ductile Zechstein salt formation. The optimal poroelastic constant, H s ¼ 10 5 MPa, is consistent with the stress field conforming more to the reservoir strain field, ϵ zz , modulated by fault offsets rather than its depletion field, ΔP. Training the stress model with the observed earthquake magnitudes instead of earthquake occurrences yields a wider range of posterior values (Figures 20b and 20c). Optimal smoothing length scales are in the range 2-4 km consistent with the previous 3 km value. Larger offset faults are still excluded although the threshold may be larger than before. Also, larger values of the skeletal modulus H s are favored indicating a stress field conforming more to the depletion field, ΔP, modulated by fault offsets. When comparing different stress-dependent earthquake magnitude models using a common elastic thin-sheet stress model (ets0), we summarize these distributions with the values σ ¼ 4 km, r max ¼ 1:1, H s ¼ 10 6:5 MPa.

Data Availability Statement
These data may be obtained from KNMI at https://www.knmi.nl. All other data were provided by Nederlandse Aardolie Maatschappij BV and are made available in an open access data repository (Bourne & Oates, 2017a).