Is Coulomb Stress the Best Choice for Aftershock Forecasting?

The Coulomb failure stress (CFS) criterion is the most commonly used method for predicting spatial distributions of aftershocks following large earthquakes. However, large uncertainties are always associated with the calculation of Coulomb stress change. The uncertainties mainly arise due to nonunique slip inversions and unknown receiver faults; especially for the latter, results are highly dependent on the choice of the assumed receiver mechanism. Based on binary tests (aftershocks yes/no), recent studies suggest that alternative stress quantities, a distance‐slip probabilistic model as well as deep neural network (DNN) approaches, all are superior to CFS with predefined receiver mechanism. To challenge this conclusion, which might have large implications, we use 289 slip inversions from SRCMOD database to calculate more realistic CFS values for a layered half‐space and variable receiver mechanisms. We also analyze the effect of the magnitude cutoff, grid size variation, and aftershock duration to verify the use of receiver operating characteristic (ROC) analysis for the ranking of stress metrics. The observations suggest that introducing a layered half‐space does not improve the stress maps and ROC curves. However, results significantly improve for larger aftershocks and shorter time periods but without changing the ranking. We also go beyond binary testing and apply alternative statistics to test the ability to estimate aftershock numbers, which confirm that simple stress metrics perform better than the classic Coulomb failure stress calculations and are also better than the distance‐slip probabilistic model.


Introduction
Large earthquakes are almost always followed by a sequence of aftershocks in the first months to years, which might themselves be destructive as, for example, in the case of the 2011 M w 6.2 Christchurch event, which was triggered by the M w 7.1 Darfield mainshock (Stramondo et al., 2011). It is generally accepted that aftershocks result from stress changes induced by the mainshock. In particular, the Coulomb Failure Stress (CFS) is commonly used as scalar quantification of the stress state. It is defined as where is the shear stress in slip direction on the fault plane, is the normal stress (positive for compression), is the coefficient of friction, and p is the pore fluid pressure. Positive CFS changes, ΔCFS > 0, indicate areas of potential aftershock activity, while no aftershocks are expected in regions with ΔCFS < 0. Many studies have demonstrated a clear correlation of the spatial aftershock pattern with static CFS changes calculated based on slip models (Harris, 1998;King et al., 1994;Steacy, Gomberg, et al., 2005;Stein, 1999). However, the applicability is still ambiguous, especially because of the lack of observational evidence for seismic quiescence in stress shadow areas associated to ΔCFS < 0 (Felzer & Brodsky, 2005;Harris & Simpson, 2002) and the missing effect of dynamic stress triggering (Felzer & Brodsky, 2006).
In general, ΔCFS calculations rely on information that contains large uncertainties, such as nonunique inversion of slip models (Hainzl et al., 2009), secondary stress triggering (Helmstetter et al., 2005), and unknown receiver fault mechanisms. CFS calculations require a definition of the fault geometry and slip direction to calculate the Coulomb stress. This is typically done by resolving stress (1) on faults with known geometry or (2) on optimally oriented planes (OOP) having maximum total Coulomb stress. Both approaches are limited either due to poorly constrained fault geometries, ignoring blind faults that could pose significant threat, or to an unknown background stress field. To account for these problems, Steacy, Nalbant, et al. (2005) suggested to fix the strike according to the regional fault trends and vary dip and rake to maximize the total stress tensor, while Hainzl et al. (2010) proposed to use a distribution of receiver fault orientation to estimate the ΔCFS net effect. As a result, stress shadows are less pervasive in agreement with observations. Statistical space-time seismicity models have also been developed based on ΔCFS calculations (so-called hybrid models), for example, Bach and Hainzl (2012) implemented ΔCFS maps as spatial kernel in the epidemic-type aftershock sequence (ETAS) model, Steacy et al. (2013) combined the spatial constraints from Coulomb stress with the short-term earthquake probability (STEP) model (Gerstenberger et al., 2005), and Cattania et al. (2014Cattania et al. ( , 2015 implemented ΔCFS in the rate-state model of Dieterich (1994) accounting for uncertainties. Those models are shown to be successful in explaining the observed trends in seismicity (Cattania et al., 2018).
Recently, the use of CFS has been questioned in general by the studies of Meade et al. (2017) and DeVries et al. (2018) showing that CFS is clearly outperformed by alternative stress metrics and deep neural network (DNN) techniques. They used receiver operating characteristic (ROC) analysis to assess the ability to forecast aftershock areas. The results suggest that alternative stress metrics such as maximum shear and von-Mises stress are more accurate and reliable than CFS. Mignan and Broccardo (2019) recently questioned the results of the DNN approach and stress metrics, proposing a distance-based approach, which is simpler and superior. In this study, we challenge the previous results by repeating the analysis with more appropriate stress calculations as well as alternative tests. In particular, we use all available slip inversions from SRCMOD database to calculate more realistic CFS values assuming layered-half spaces and variable receiver mechanisms. Furthermore, we explore the previously ignored effect of different magnitude cutoffs, grid size variations, and aftershock durations to verify the use of ROC analysis for ranking the stress metrics. We also perform a sensitivity test for aftershock locations using Monte Carlo simulations of catalogs with spatial uncertainties drawn form a Gaussian distribution around the real hypocenter. Because of limitations of the binary forecasts and the ROC analysis, we finally perform additional tests of forecasts of aftershock numbers based on the different stress metrics.

Data
We use finite-fault rupture models from the SRCMOD (https://equake&hyphen;rc.info/SRCMOD/) database by Mai and Thingbaijam (2014). As of 27 November 2019, the database consist of 406 models from 188 earthquakes. However, in this study, we use only a subset of 289 models related to 130 distinct earthquakes for which reviewed aftershock data were available. The slip models are based on single or joint inversion of seismic, geodetic, and other available data.
We use the International Seismological Center (ISC) catalog and select all events occurred within 1 year and within 100 km horizontal distance to the mainshock fault, with a depth range from 0 to 50 km. The catalog is obtained in the form of a pickle file (binary format) taken from the released data of DeVries et al. (2018). These events are called aftershocks, despite the fact that some of them were probably not be related to the mainshock. The catalog covers the period between 1 January 1964 and 30 November 2012. Out of 1,689,845 total events in the reviewed catalog, selection yields 410,064 aftershocks for the analysis.

Stress Metrics and Distance Model
The finite-fault rupture models are used to calculate stress changes in a region up to 100 km away from the rupture plane and from 0 to 50 km depth. We use the PSGRN + PSCMP tool by Wang et al. (2006) to calculate the stress tensor in a 5 × 5 × 5 km gridded volume. Given the stress tensor, we calculate five different scalar quantifications. Additionally, we also use the distance-slip probabilistic model suggested by Mignan and Broccardo (2019) comparing the performance of stress based metrics: 1. ΔCFS on master fault orientation (MAS), where ΔCFS is calculated for a receiver mechanism identical to mainshock mechanism. 2. ΔCFS on optimally oriented planes (OOP), assuming a background stress field with principal stress components 1 = 1, 2 = 0, and 3 = −2 MPa, that is, a differential stress of 3 MPa, which is in agreement to the average stress drop of interplate earthquakes (Allmann & Shearer, 2009). The orientation of the principle components is in a way that the stress field is optimally oriented for the mainshock rupture.
where H is a Heaviside function and N p = 1,500 is the number of random planes which are selected from a Gaussian distribution centered around the mainshock mechanism, with an assumed standard deviation of 30 • for strike, dip, and rake. 4. Maximum Shear (MS) where is the stress tensor and 1 and 3 are corresponding eigenvalues of the deviatoric stress tensor. 5. von-Mises stress (VMS) where I 1 and I 2 are first and second invariant of deviatoric stress tensor. VMS is a scaled version of second invariant of the deviatoric stress change tensor (DeVries et al., 2018). If VMS of a material under a load is equal or greater to the yield limit, then the material will yield. 6. Distance-slip probabilistic model (R), which was introduced by Mignan and Broccardo (2019) by a logistic regression based on average slip, d, and the minimum distance, r, between the fault and grid node. The probability Pr(r, d) of earthquake occurrence in each grid is given by with parameters 0 = 10.18 ± 0.07, 1 = −2.32 ± 0.02, and 2 = 1.16 ± 0.01, which were obtained by a fit to 75% of the data set.
In general, the stress metrics are calculated for a layered half-space, where the layering is based on the CRUST 2.0 (Bassin, 2000) velocity model. However, as a reference model, we use the ΔCFS values calculated for a homogeneous half-space (MAS 0 ) with assumed Lamé's parameter = = 30 Gpa. For the Coulomb stress calculations, we use the constant apparent friction model (Cocco & Rice, 2002), according to which the Coulomb-stress changes can be written as ΔCFS = Δ −̃Δ with the effective friction coefficient = (1 − B) , where B is the Skempton's coefficient. In our study, we use a value of̃= 0.4 (King et al., 1994).

Methods
To evaluate the forecasting capability, we use the same binary classification method, that is, receiver operating characteristic (ROC) analysis, which has been introduced in previous studies to rank the performance of metrics ( The cutoff thresholds are defined by stress change values, where stress values are first arranged in ascending order before each stress value is used as a cutoff to calculate TPR and FPR. A test with random classification of binary data has equal rates of true-positive and false-positive classification. For such a case, the area under the curve (AUC) value of a ROC curve is equal to 0.5. A model that has AUC > 0.5 is better than a random classifier. A model performing no better than a random classifier (AUC < 0.5) can be rejected. Therefore, we use the ROC analysis to quantify the accuracy of our metrics for classifying areas with or without aftershocks. The ROC results can be biased by the inhomogeneity of the earthquake catalog due to varying completeness over time and space, associated location uncertainties, and the occurrence of background activity, which is not related to the mainshock. To test the potential effect of these issues, we also calculated AUC for different aftershock durations (aftershocks up to 1 year excluding first 24 hours, aftershocks in the first 3, 9, and 12 months after the main shock). Furthermore, we explore the potential dependence of the results on the grid size (2.5, 5, and 10 km). We also perform a sensitivity test concerning the effect of location uncertainties by repeating the analysis for 25 randomized catalogs, where the original earthquake location is perturbed in each direction by a Gaussian distribution with standard deviation of 3 km.
Parsons (2020) recently discussed the fact that the imbalance of typical aftershock distributions with most areas lacking events inhibits resolving power of the ROC analysis. The binary ROC test generally suffers from the fact that the test is dependent on the magnitude cutoff. This can be easily seen by considering the end-member case of an earthquake catalog, which is complete to very low magnitudes. In this case, earthquakes would likely be recorded in all subvolumes due to ongoing background activity which make the ROC results insensitive to the tested metric. The same holds for the opposite case that the threshold is too high and no aftershock is found. For intermediate cases, the test has some statistical power which depends on the cutoff value. For illustration of the dependence, we present in section 5 the ROC results for different magnitude cutoffs in the case of the 1999 Chi-Chi sequence. Later on we also present the results of a systematic test of the dependence of the AUC values on aftershock duration and magnitude cutoff for all slip models to find the best possible combination of both choices for forecasting aftershocks. The four chosen combinations are (1) aftershocks within the first year and a cutoff magnitude M m − 3 with M m being the mainshock magnitude, (2) aftershocks within the first 3 months without cutoff magnitude, (3) aftershocks within the first 3 months and cutoff magnitude M m − 3, and (4) all aftershocks within 1 year without any cutoff magnitude.
Because of the general flaws of the ROC analysis, we additionally perform tests of the number forecasts as used in CSEP tests for seismicity models (Jordan, 2006). In this way, we test both the ability to forecast the activated area and the strength of the activation, which is important for real hazard assessment. For this, we transform the spatial stress values into a spatial probability map of aftershock occurrences, which is based on the assumption of a linear dependence of the triggering potential on the stress change (if positive). In particular, the number of events in each grid cell ( n ) is assumed to be proportional to the positive stress metric change (S n > 0) where H is the Heaviside function. The normalization factor c is determined by the condition that summing over the whole region results in the total number of observed aftershocks where N obs is the total number of observed aftershocks within 1 year after the mainshock and N g is the number of grids cells in the region. Equations (8) and (9) can be used to determine the number of events in each grid cell. The likelihood of the observation in each cell is calculated by assuming a Poissonian process with an average rate n . However, this approach would lead to an immediate falsification of a model, if the model predicts 0 activity in a grid cell, where one or more events occurred. While this is theoretically correct, real aftershock catalogs are potentially contaminated by background activity and incorrect locations. To minimize this problem, we distribute a fraction f of aftershocks homogeneously in space, that is, Here we choose a fraction (f = 0.01) which is uniformly distributed. The probability that N n events will occur in a given time period and in the nth grid with predicted rate n is described by to the Poisson model with and the joint log-likelihood for all grid cells becomes The joint log-likelihood has a negative value, and the values closer to zero indicate that forecasts are close to the observations.

Results
The stress tensors generated from the PSGRN + PSCMP (Wang et al., 2006) program are used to calculate the stress metrics described in section 3. While our quantitative analysis is always done for the calculated stress in the whole gridded volume extending ±100 km horizontally and from 0 to 50 km in depth, we first illustrate the calculations by selecting a specific depth level for one specific case, namely, the slip distribution of 1999 Chi-Chi earthquake derived by Ma et al. (2001). In Figure 2, we compare the stress maps generated by the different scalar stress calculations. Classical Coulomb stress calculations have positive and negative values which mark the regions with and without possible aftershocks, respectively. Correspondingly, the values of MAS 0 , MAS, and OOP are negative and positive. In contrast, VM, MS, and VMS are only positive. In order to compare the results on a same scale we use a Sigmoid filter sig(10S n − 1) (where sig(x) = 1 1 + e −x ). The stress maps are computed at the depth of the mainshock hypocenter (7.5 km), where the epicenter of the mainshock is marked by the yellow star. The stress maps are compared to areas (black squares) where 10.1029/2020JB019553 aftershocks occurred within 1 year after the mainshock in the depth interval of 7 ± 2.5 km. ΔCFS for MAS 0 and MAS (Figures 2a and 2b, respectively) show very little to no difference in their maps. However, the OOP-type Coulomb stress map (Figure 2c) is significantly different from the former as verified in previous studies. There are few regions with high sigmoid values and no aftershocks and few aftershocks occurring in the stress shadows. Figure 2c, related to ΔCFS calculated on distributed planes, shows maximum sigmoid values in the near-fault region with decreasing values in the far field. Figures 2e and 2f, which are related to maximum stress and von-Mises stress, indicate increased stress values in the near as well as the far field.
For the same example, we also performed a detailed analysis of the ROC test. The availability of a large number of aftershocks in the catalog of the Chi-Chi event makes it a suitable case for testing. The catalog is downloaded from ISC and contains 41,351 events. Unlike the catalog mentioned in section 3 this is an updated catalog with significantly lower magnitude cutoff. Figure 3a shows the aftershocks in a volume of 100 × 100 × 50 km plotted as magnitude versus time. We use the frequency magnitude distribution (Figure 3b) to estimate the magnitude of completeness (M c ≈ 2.2). The ROC curves using all aftershocks and stress metrics in the gridded 100 × 100 × 50 km region are plotted in Figure 3c. The analysis reveals that the best performing metrics are maximum shear (AUC = 0.744) and von-Mises stress (AUC = 0.749). The AUC value for the VM model with ΔCFS calculated on distributed planes is lower but also performs well (AUC = 0.721). For the specific case of the VM model, we additionally check the dependence of ROC curves on the magnitude cutoff (M cut ), aftershock duration, and grid size. The magnitude cutoff test (Figure 3d) was performed for cutoff magnitudes of −1 (complete catalog), 2, 2.2 (M c ), 3, 4, and 5. A clear dependence of AUC on cutoff magnitude is observed, where AUC increases with increasing M cut . This result indicates that larger aftershocks are in better agreement with the calculated stress maps.
The ROC analysis might be significantly biased by background activity, which may begin to dominate with increasing time, as well as catalog incompleteness directly after the mainshock. To explore the potential effects, we also analyzed the ROC curves for different aftershock durations. We find that ignoring the first day, where the catalog is likely incomplete, does not have a significant effect (AUC of 0.696 instead of 0.698). However, ignoring later events does have a clear impact. The results for the first 3, 6, 9, and 12 months after the mainshock show that the AUC value systematically decreases (Figure 3e) with increasing aftershock duration. The maximum AUC is observed for the aftershocks occurring within the 3 months of the mainshock. This indicates that the occurrence of background events in the later phases can significantly blur the test results. The dependence of AUC on grid size is tested by calculating the stress tensors at the centroid of cubes with edge length of 2.5, 5, and 10 km cells. Figure 3f shows the resulting ROC curves for the VM stress metric calculated on different grid sizes. A systematic dependence of AUC on grid size is observed where smaller grid sizes increase the performance of the metric, indicating that the details in the stress maps are useful for aftershock forecasting.
To analyze whether the results for the Chi-Chi event can be generalized, we generated ROC curves for all 289 slip distributions and corresponding aftershock distributions. All resulting ROC curves are plotted in Figure 4 (thin gray lines) for all stress metrics introduced in section 3, where the thick blue line refers to the average curve for each stress metric. The average is calculated by averaging true positive rates in false positive rate bins. We then determined the corresponding AUC for the average curve. We observe no clear difference between the reference metric MAS 0 and MAS calculated for regional layered crust models (Figures 3a and  3b). As expected, ΔCFS resolved on OOP results in a higher AUC (0.659) than MAS 0 and MAS (0.491 and 0.490, respectively). Furthermore, the AUC value (0.718) for ΔCFS calculated for variable mechanism (VM) is significantly higher as compared to the other CFS metrics but lower than the stress scalars MS and VMS (0.743 and 0.746, respectively). However, the maximum AUC value of 0.758 is obtained for the distance-slip probabilistic model (R) for which the figure is included in the supplementary information as Figure S4. Note that the R-model was optimized on a large fraction of the data, while the stress metrics are not optimized. Thus, this comparison might be biased, but the comparison of Mignan and Broccardo (2019) which was performed only on a subset of the test data showed the same tendency. Figure 5a shows the effect of location uncertainties of 3 km on the AUC values. We find the standard deviation of the resulting mean AUC values is small, and the ranking remains the same. Figure 5b shows a box plot to indicate the distribution of mean, median, quartiles, and extremes of AUC values for all slip models and stress scalars. The metrics OOP, VM, MS, and VMS and distance-slip model (R) have their mean, median, and first quartile all above the AUC threshold (0.5), while the mean and median for MAS 0 and MAS do not even cross this threshold. R performs best in terms of mean, median, and quartile range, but the distribution strongly overlaps with those of VMS, MS, and VM as the next best performing metrics. To test the robustness of the ranking, we repeated the calculation for different combinations of M cut and the aftershock period (see section 4). Figure 5c shows the results of the test, which confirms the robustness of the ranking. However, the AUC scores systematically vary for all metrics and for the distance-slip model depending  on the different settings. In particular, the best performing scenario (Figure 5c green dots) is obtained for aftershocks within the first 3 months after the mainshock and M cut = M m − 3. Now, we go beyond binary testing and use a statistical test to estimate the aftershock numbers because real aftershock forecasts rely on the event density as discussed in section 4. The result of the log-likelihood (LL) test is shown in Figure 6. For a better comparison of LL values for mainshock-aftershock sequences with different number of aftershocks, the resulting LL value was divided by the number of aftershocks in each case. The results are presented by box plots (whisker diagrams), which are used to study the distribution of LL values. We observe a very similar trend as in the case of the ROC analysis. However, in this case, the distance-slip probabilistic model (R) now ranks below the ΔCFS-based VM metric. Stress metrics VM perform significantly better than the conventional Coulomb stress calculations, but MS and VMS are still better. The best result in terms of number forecasts is obtained for VMS.

Discussion
Coulomb failure stress has been largely used to explain earthquake triggering and particularly to describe the locations of aftershocks by separating the region into positive and negative stress change areas. Theoretically, aftershocks are only expected in regions with increased CFS value. However, it has been already recognized that aftershocks frequently occur in regions with calculated, negative stress changes, so-called stress shadows (Felzer & Brodsky, 2005;Harris & Simpson, 2002). It has been suggested that this is related to uncertainties in the slip models (Hainzl et al., 2009;Helmstetter & Shaw, 2006;Marsan, 2006) as well as the variability of the receiver mechanisms (e.g., Hainzl et al., 2010). Taking these uncertainties into account has been previously shown to improve the forecasts significantly (Cattania et al., 2014). A study conducted by Steacy, Nalbant, et al. (2005) suggests in a poorly defined regional stress area, or in a structurally complex area, the strike of the receiver planes should be fixed to that of the mainshock plane and let the dip and rake vary to calculate the Coulomb stress.
The uncertainties and finite resolution of finite-fault models reduce the capability of Coulomb stress to reproduce the on-fault aftershocks (Steacy et al., 2004). With majority of aftershocks occurring in the proximity of the mainshock rupture plane (Felzer & Brodsky, 2006;Gu et al., 2013;Moradpour et al., 2014), the Coulomb stress metric is expected to suffer more strongly than the simple metrics from those limitations. In order to test whether this explains our results, we repeated the ROC analysis by eliminating near-fault grid cells as well as their respective aftershocks (see Figure S3). While MAS 0 and MAS do not significantly improve, the performance of the OOP metric deteriorates when excluding near-fault aftershocks. While this is counterintuitive, it is likely related to the effect of background activity, which is not associated to the mainshock stress. Excluding the near-fault area also excludes the area with the highest signal-to-noise ratio.
The recent studies of Meade et al. (2017) and DeVries et al. (2018) claimed that simple stress metrics, which do not need any specification of the receiver mechanism, are superior to the Coulomb stress calculations. However, their claim can be challenged because of the unrealistic CFS calculations, which did not account for the known uncertainties in the CFS calculations, as well as the potential artifacts and shortcomings of their ROC analysis. Thus, we performed a systematic reanalysis including (i) previously introduced CFS scalars accounting for receiver fault variability (OOP and VM), (ii) improved stress calculations based on regional, layered velocity models, (iii) different time windows and magnitude cutoffs, and (iv) the LL test quantifying the forecasts of the spatial distribution of aftershock numbers. Countering the approach of DeVries et al. (2018) and Mignan and Broccardo (2019) shows that a logistic regression model using average mainshock slip and measured distance performs better than DNN approach. Hence, we consider the distance-based model as reference model and perform a detailed analysis to compare the performance of stress metrics.
Our comprehensive analysis shows that the results of OOP and VM are significantly better than the previously tested MAS value. This indicates the importance in accounting for the variability of aftershock mechanisms to get more realistic CFS predictions. Further, geological constraints can be used to narrow down the standard deviation in receiver fault distribution to obtain more realistic results. While there are improvements observed in the performance of OOP and VM, scalar stress metrics still performed significantly better than receiver dependent metrics. The underlying reason is not yet clear and needs further research. It might be that MS and VMS are more efficient in accounting for triggering mechanisms, which are not directly considered, such as afterslip, poro/visco-elastic deformations, and dynamic stress triggering. However, our tests show that distance-slip (R) is the best performing model for forecasting the aftershock area (binary forecast). This result indicates that there might not be any need to calculate stress tensors when forecasting the activation area. In contrast, the best stress metrics are found to outperform the R-model in regards to forecasting earthquake numbers, that is, the spatial density of aftershocks which is more important for seismic hazard studies. It should be noted that our analysis of alternative stress metrics is not exhaustive, and other metrics might be even better. For example, Terakawa et al. (2020) just introduced a new energetics-based stress metrics ΔETS, jointly accounting for coseismic stress changes and the background stress field. While their test for the case of the Landers aftershock sequence shows encouraging results, Mignan and Broccardo (2020) replied that it is likely not better than their distance-based approach. Whether or not ΔETS systematically improved forecasts might be tested in a future study similar to the present one.
For our analysis, we tried to use some meaningful parameters for the CFS calculations. Using more realistic friction coefficients, background stresses (OOP), or uncertainties in the receiver mechanisms (VM) might improve the results of the CFS metrics. However, such a retrospective optimization of model parameters would bias the comparison, because MS and VMS have no free parameters. Although the background stress field for the OOP model could be set according to alternate rules. For example, Mignan (2020) suggested that the prestress is released by the mainshock stress drop. So, we recalculate the OOP metric for deviatoric background stress, which is calculated for each mainshock individually to equal the estimated stress drop (see details in the supporting information Figure S5). A notable improvement is observed in the OOP results, but it does not change the ranking of stress metrics and distance-slip model.
Our test results might be distorted by background events and aftershocks triggered by aftershocks, so-called secondary triggering. Our analysis already indicates that the results change for different time windows, with best results for aftershocks in the shortest time period after the mainshock. This points to the blurring effect of background activity. Declustering may possibly remove the effect of higher order aftershocks, but no simple method exists which could be applied on our diverse data set. In general, the epidemic type aftershock sequence (ETAS) model accounts systematically for background as well as secondary aftershock triggering. Implementing a spatial kernel based on the stress metrics could be one way to do systematic tests including background and secondary aftershock triggering. However, those studies can only be performed for individual sequences with high data quality.
In our study, we do not consider the effect of dynamic stress changes, which is used to explain aftershock locations within a week of a large mainshock (Prejean et al., 2004). However, our results indicate that the stress metrics works best for the largest aftershocks triggering in the shortest time period after the mainshock. This is encouraging, because they have the largest impact on seismic hazard. A thorough analysis can be performed like in Figure 5c to specify the forecasting thresholds on the time period and cutoff magnitudes for different tectonic regions.
It is important to note that our study does not discard the use of CFS in general and Coulomb failure theory might still describe the physics for earthquake triggering. It rather indicates that CFS in the case of limited fault information is not the best choice for aftershock forecasting. However, if precise information about the receiver planes are available, CFS might still be the best choice, for example, to evaluate the trigger potential on well-known neighboring fault segments.

Conclusion
Despite its frequent application for several decades, Coulomb failure stress calculations have been questioned by recent studies and shown to be outperformed by other stress scalars and state-of-the-art methods like deep neural network in forecasting aftershocks. However, the recent results are also questionable because of an artificial DNN application (Mignan & Broccardo, 2019) as well as simplified CFS calculations. As this has broad implication for this research area, we performed a comprehensive reanalysis of the previous ROC-based study. Here we include CFS metrics accounting for the variability of aftershock mechanisms and additionally taking account of the incompleteness of catalogs as well as the occurrence of background activity. In addition to the previously conducted ROC analysis for binary forecasts, we also tested forecasts of aftershock numbers.
To summarize, we find that the results of the ROC analysis are dependent on the magnitude cutoff, aftershock duration, and grid sizes and that more realistic CFS calculations (OOP and VM) can significantly improve the results. However, our analysis verifies that the stress scalars MS and VMS, and distance-slip probabilistic model (R), all of which do not rely on any specification of receiver mechanisms, outperform on average the CFS metrics in all test setups. While CFS might still be used for the evaluation of the stress changes on well-defined fault segments, our results indicate that spatial forecasts of the aftershock density might be generally improved by using von-Mises stress (VMS) instead of Coulomb stress.