Predicting Imminence of Analog Megathrust Earthquakes With Machine Learning: Implications for Monitoring Subduction Zones

Subduction zones are monitored using space geodesy with increasing resolution, with the aim of better capturing the deformation accompanying the seismic cycle. Here, we investigate data characteristics that maximize the performance of a machine learning binary classifier predicting slip‐event imminence. We overcome the scarcity of recorded instances from real subduction zones using data from a seismotectonic analog model monitored with a spatially dense, continuously recording onshore geodetic network. We show that a 70–85 km‐wide coastal swath recording interseismic deformation gives the most important information on slip imminence. Prediction performances are mainly influenced by the alarm duration (amount of time that we consider an event as imminent), with density of stations and record length playing a secondary role. The techniques developed in this study are most likely applicable in regions of slow earthquakes, where stick‐slip‐like failures occur at time intervals of months to years.


Introduction
The preparatory phase of large subduction earthquakes can be depicted as a period of slow, continuous stress accumulation caused by the frictional interaction between converging plates (e.g., Hyndman et al., 1997). However, as geodetic (Global Navigation Satellite System [GNSS]) observation networks have matured, it has become apparent that there are significant variations of interplate locking before and/or after large earthquakes. These include transient slow slip events, indicating sub seismic-cycle scale, short (days to months) variations in the rates of stress accumulation/release (Heki & Mitsui, 2013;Loveless & Meade, 2016;Mavrommatis et al., 2014;Melnick et al., 2017). Such variations result in nonsteady (transient) surface motions measurable with space geodesy. Stress variations prior to large earthquakes may manifest as a series of foreshocks gradually unzipping the plate interface-as in the case of the 2014 Iquique M 8.1 earthquake (Schurr et al., 2014)-or as accelerating aseismic creep-as suggested for the 2011 Tohoku M 9.0 earthquake (Kato et al., 2012;Mavrommatis et al., 2014). The recognition of these pre-earthquake transients raises the potential for using them as a diagnostic tool for earthquake imminence. However, the scarcity of recorded instances hinders understanding whether and which transient signal may be used as a reliable indicator for earthquake prediction.
Machine learning (ML) represents a group of algorithms efficient in identifying indicators and not-so-obvious ("hidden") patterns in large data sets ("big data"). The possibility to train an algorithm and use it for making accurate predictions based on the "past experience" is one of the complex tasks that ML can achieve (e.g., Bergen et al., 2019). Recently, the earthquake research community has demonstrated such capability of ML to draw inferences about fault physics: The acoustic signal emitted by rock samples sheared in a direct shear apparatus has been used for predicting the onset time of laboratory earthquakes (Rouet-Leduc et al., 2017), for estimating the instantaneous fault analog friction (Rouet-Leduc et al., 2018), and for predicting earthquake slip mode . Changing scale from laboratory to nature, ML has been used to identify a tremor-like signal emitted by Cascadia's megathrust that tracks the instantaneous displacement rate measured by a GNSS station . Accordingly, we have possibly entered a new era of seismological discovery in which the full spectrum of transient signals from seismogenic faults is identified with ML and in which ML might diagnose imminence of events such as slow slip or even earthquakes. In other words, ML might be able to recognize and classify a set of processes or a characteristic pattern diagnostic of the imminence of fault failure. To gather the maximum benefit from such a novel approach, some technical points regarding the type and characteristics of data to use should be addressed in advance. For geodetic networks, the key questions would be as follows: Which region of the subduction margin is the most diagnostic? How important is the space-time data coverage? How far in advance can coseismic slip be predicted?
To address these questions, we experiment here with ML binary classifiers for predictions of earthquake imminence using GNSS-like surface deformation data from seismotectonic scale (analog) modeling . Analog modeling allows us to experimentally overcome the lack of long time series from real subduction zones, with a smaller physical model mimicking a multicentury history of seismic cycling in a few minutes in the lab (see Hubbert, 1937, for theory of scaling in analog modeling). Such analog model reproduces the basic features of earthquakes and seismic cycles, namely the elastic loading and release of a frictional fault embedded in an elastic medium. Indeed, our model is strongly simplified with respect to the natural prototype, where additional factors such as pore fluid pressure (e.g., Moreno et al., 2014;Moreno et al., 2018) or off-megathrust fault networks (e.g., Wang & Bilek, 2011) might control seismicity. Nevertheless, similar models have been shown to successfully reproduce complexity by means of the intrinsic variability of, for example, recurrence and slip patterns of natural systems to first order (e.g., Corbi et al., 2013;Corbi et al., 2017;Rosenau et al., 2010Rosenau et al., , 2019. As in Corbi et al. (2019a), we here use the surface deformation time series of such a model for ML-based analysis. ML provides a versatile tool for testing how various data characteristics influence prediction performances. With respect to a previous study by Corbi et al. (2019a), we here impose the following modifications: 1. Instead of framing the scientific problem with regression of time to failure, here we step toward binary classification of alarm state-that is, the ML predicts whether a given deformation field is characteristic of the few seconds that precede slip onset or not. Binary classification has the advantage of easy to interpret metrics for evaluating the prediction performances. 2. Instead of using an ideal, uniformly spaced GNSS network extending all the way to the trench, here we exclude data coming from above the analog offshore seismogenic zone to mimic limitations in geodetic coverage at subduction zones (i.e., we use only data from above the inland aseismic zone). Hence, we assume that the base of the seismogenic zone coincides with the coastline (Ruff & Tichelaar, 1996;Saillard et al., 2017; Figure 1a).
We show that ML can be used for identifying the most informative region of the convergent margin regarding imminent asperity failure and provides a useful indication of how many measurement points are needed on the surface as well as the optimal record length.

Data, Method, and Metrics
Data used in this study are derived from a seismotectonic analog model that represents a subduction zone characterized by two asperities of equal size and friction ( Figure 1a; Corbi et al., 2017; details on the setup are in Supporting Information S1). The model produces analog earthquakes equivalent to magnitude Mw 6.2-8.3 when scaled to nature, with a coefficient of variation in recurrence intervals of 0.5, similar to real subduction zones (Williams et al., 2019). Here we use the data of Corbi et al. (2019a) that are available open access in Corbi et al. (2019b). Data consist of a 400 s recording of incremental surface displacement (trench parallel and orthogonal components; Figure 1b) capturing 40 seismic cycles. Displacement is measured with a precision of few tens of μm (i.e., few tens of m when scaled to nature) and at a resolution of few mm spacing between virtual GNSS stations (i.e.,~7 km when scaled to nature) using Particle Image Velocimetry PIV (Sveen, 2004). PIV data can be considered equivalent to a spatially dense, continuous GNSS network.
Data are organized in a matrix of predictors X (3,000 rows by 1,508 columns) where each column corresponds to "displacement measured at a synthetic GNSS station" and rows correspond to time steps (increments). We use exclusively the trench orthogonal and trench parallel components of the displacement field as input features because they have been shown to be the most informative (among surface deformation descriptors) for this model (Corbi et al., 2019a). For our target variables, we select nine target points distributed along the margin ( Figure 1a) for which we identify the slipping time of analog earthquakes (i.e., experimental time at which displacement rate exceeds 0.01 cm/s) and the slip onset t so . Then, for each event, we assign the label "alarm" at those time steps that are comprised between t so -Δt and t so (where Δt indicates alarm duration) and "no alarm" at the remaining ones ( Figure 1c). This procedure is applied at each target point, so that the output is made of nine response vectors Y. Data are split into training and testing sets. We use the supervised learning Random Undersampling RUSBoost ensemble algorithm running under Matlab (Seiffert et al., 2008), which, in the training phase, "learns" the relationship between the displacement field and the alarm state of each target point. The algorithm is then fed with testing data and predicts if, at a specified time, a portion of the margin is in alarm given the current displacement field ( Figure S1).
RUSBoost selects (subsamples) a random fraction of the most represented class (no alarm) in order to have a balanced data set. This allows RUSBoost to be particularly effective in classifying an imbalanced data set as

10.1029/2019GL086615
Geophysical Research Letters in our case, where alarms represent 3-27% of the observations, depending on alarm duration and selected target point. After this initial step, RUSBoost proceeds, as in the Adaptive Boosting approach (e.g., Freund & Schapire, 1997), with sequentially building an ensemble of binary decision trees where nodes are displacements measured at various points on the model surface. For each tree, the algorithm computes the weighted classification error and then increases weights for observations misclassified at a given step and reduces weights for observations correctly classified, so that the following tree is trained with updated weights. This procedure is repeated to progressively improve the classification performances (Supporting Information S2).
Binary classification algorithms produce "positive" or "negative" outcomes depending on the identified system state (i.e., alarm and no alarm). An event may occur or not (we truly have a slip event or not). Based on the correctness of the predicted outcome, four cases are possible, as summarized by the confusion matrix ( Figure S2): true positive TP, true negative TN, false positive FP, and false negative FN. Precision and recall are two basic evaluation measures for binary classifiers. Precision is defined as TP/(TP + FP) and tells us how many times the raised alarm is correct over the total times we raised an alarm. Recall is defined as TP/(TP + FN) and represents the number of analog earthquakes that have been forecasted by alarms over the total number of earthquakes (both correctly forecasted and missed). The receiver operating curve ROC and the precision-recall curve PR are two other useful indicators of model performance. The ROC is a representation of recall against the false positive rate at various cutoff values used by the algorithm to separate between alarm and no alarm and informs on how well the classifier separates the two classes.
The PR shows the trade-off between precision and recall and provides information about how effective a classifier is without raising too many false alarms. The areas under ROC and PR curves (AUC-ROC and AUC-PR) provide single measures of the classification performances ranging from 0 to 1, with 1 representing a perfect classifier. Here we report all above-mentioned metrics and highlight that, in our case, AUC-PR is more informative than the ROC, being independent from the larger fraction of no alarms of our time series (Saito & Rehmsmeier, 2015).

Sequential Features Selection to Identify the Most Informative Region of the Margin
In ML, feature selection (FS) is an important step that precedes model training, especially when a data set has a number of features larger than observations. In our case, where the features to observations ratio can be >1, FS discards less useful and redundant features to improve the model performance and avoid overfitting. FS, by reducing the number of features to interpret, also increases the interpretability of the ML results. To perform FS, we used the Matlab algorithm sequentialfs on training data. sequentialfs considers the interaction of features and selects a subset of them by sequentially adding one feature while keeping track of the misclassification rate and features index. The minimum of the misclassification error is used as stopping criterion. This way, sequentialfs identifies the most relevant features for classification. Since the features that build our X are synthetic GNSS time series of known coordinates, sequentialfs highlights the most diagnostic region of the margin. This approach implies that additional stations located outside the region highlighted by sequentialfs would be surplus to requirements for predictions. Figure 1d shows the location of the synthetic GNSS stations selected by the algorithm for each of the nine target points. Except for Target Point 1 where trench normal and trench parallel components are of equal importance (probably due to boundary effects), the trench normal component appears to be more informative than the trench parallel one. The most informative synthetic GNSS stations generally face the offshore asperity where the selected target point is located. We merged the nine groups of selected stations into a 2-D histogram counting the number of times a synthetic station is selected in a given portion of the margin. The histogram highlights a 10-12 cm-wide (or equivalently 70-85 km when scaled to nature) region parallel and adjacent to the analog coastline as the most diagnostic area to monitor for successful predictions, especially regions adjacent to the two asperities (Figure 1e).

The Role of Training Window Length, Density of Stations, and Alarm Duration
After FS, the ML algorithm is fed only with the most informative features. Our procedure then consists in building and updating a predictive model three times using a shifting training window ( Figure S3). The number of updates has a minor influence on predictions ( Figure S4). This procedure is repeated for each 10.1029/2019GL086615

Geophysical Research Letters
CORBI ET AL. target point; the prediction model is thus an aggregate composed of nine time series. We compare the predictions with observations as a function of space (i.e., at various target points) and time (Figure 2a). A red line moving up highlights an observed alarm. If the corresponding blue line moves downward simultaneously, it indicates the classifier correctly interpreted the current deformation field as leading to failure in the near future. Different scenarios appear. Looking at the first 50 s of Target Point 2, for example, we observe predicted alarms appearing with a delay with respect to observations, almost perfectly on time, or when there is no alarm at all. This variation in model performance can be quantified as follows: ML predicted correctly the 80% of alarms (i.e., precision), and 90% of no-alarm predictions were reliable (i.e., negative predictive power; Figures 2b-2h). Predictions for target points located above the barrier (i.e., Target Points 4-6) display relatively worse performances, with precision decreasing to ≈50% due to the smaller number of instances available for the training if compared to target points located above the asperities.

Geophysical Research Letters
We repeated the above procedure 80 times, exploring the role of training window length (i.e., record length), density of stations, and alarm duration, following a 3-D grid search approach (Figure 3a). In particular, we varied the training window length in the 100-200 s range, or equivalently for 2-10 analog seismic cycles depending on the training window length and target point; density of stations from 4 to 72 stations per square decimeter, or equivalently 6 to 107 stations per 100 km 2 (see Supporting Information S1 for information about scaling); and alarm duration from 1 to 5 s, or equivalently 0.05 to 0.25 the average seismic cycle duration (a rough estimation of the ratio between slip and stick phases in nature is in the order of 10 -8 for large subduction earthquakes and 10 -1 for slow earthquakes; see section 4.2). For each aggregate we report AUC-PR averaged over the nine target points as a single parameter describing the prediction performance of the aggregate. Independently from alarm duration, we generally observe that models with longer training have higher AUC-PR than models with a short training. Such improvement appears more evident for models with higher alarm durations ( Figure S5). The effect of station density in the explored range shows AUC-PR oscillating with unclear trends. In general, the effect of station density and training window length on prediction performances for a given alarm duration is smaller than 10%. The highest average AUC-PR value has been recorded for the longest record (i.e., 200 s) and a relatively sparse (i.e., six stations/dm 2 ) network. A clear increase of average AUC-PR emerges upon comparing models with increasing alarm durations (Figure 3a).

Implications for Monitoring Subduction Zones
We have used GNSS-like data from an analog model reproducing multiple subduction megathrust seismic cycles to feed a ML algorithm that predicts the imminence of a slip event. We have identified the most important features to feed into this algorithm and have tested influence of training data size, imminence duration, and measurement spatial density on the performance of the binary classifier.
From feature selection, we found that the region bringing the most important information is located along a 10-12 cm-wide (or equivalently 70-85 km when scaled to nature) swath parallel and adjacent to the coastline, where the highest displacements are measured. This finding supports a scenario where signs of imminent failure come from the downdip edges of asperities, being the closest to the coastline, in agreement with regions of highest shear stresses found in mechanical modeling studies (Bürgmann et al., 2005;Moreno et al., 2018). The trench-normal component of displacement appears more informative than the trench parallel one, likely due to its larger signal caused by trench perpendicular convergence. Unfortunately, we had no access to vertical deformation, and it would be interesting to test whether these data could further improve the predictions. It would also be beneficial to test whether the width of the informative swath scales with the width of the seismogenic zone. The recent development of seafloor geodetic observations in some margins affords us better updip resolution of interplate coupling (e.g., Yokota et al., 2016) and coseismic slip (e.g, Romano et al., 2012). Because in our approach ML primarily tracks the loading and unloading history of the forearc (Corbi et al., 2019a), we tested to what extent offshore kinematic observation would improve the prediction. To do so, we first checked which region would be highlighted by sequentialfs assuming the availability of a dense network of stations both onshore and offshore. In this case, the vast majority of informative stations would be offshore, centered above the asperities along strike and with a preferential elongation of the network along the dip ( Figure S6). Since implementing such a dense offshore network would be too expensive, we thus tested the effect of the availability of only two stations centered above the asperities or an array of nine offshore stations aligned along the margin, in addition to onshore stations. Taking an onshore network of roughly one station every 50 km as a reference, we observed an improvement (between 3% and 73% depending on alarm duration and network configuration) of the predictions performances for both tested configurations ( Figure S7). In particular, we found that for the short alarm durations, the nine stations configuration provides the best result, while upon gradually increasing the alarm duration, the two stations configuration has the best performances. These findings strengthen the idea that having access to the outer wedge deformation would be advantageous for future discoveries while also showing that the optimal configuration to use depends on the investigated target.
We have also shown that, under the studied configuration and with the adopted technique, a short duration (roughly 0.05-0.10 the duration of analog seismic cycles) prediction is unfeasible-even if a dense network 10.1029/2019GL086615

Geophysical Research Letters
with a record of up to 10 cycles is available. This is due to our framing: The large number of stations creates a wide X with many features (i.e., columns) and few observations (i.e., rows) that are controlled by number of events and alarm duration (at least in the investigated range of training window length and earthquake recurrence time). Therefore, ML has too few data for classification so that a given deformation can be interpreted either as an alarm or a no alarm. On the contrary, if we ask ML for a less focused prediction (e.g., 0.15-0.25 times the duration of analog seismic cycles), the training has a larger number of instances, because at longer alarm duration corresponds a longer X and the algorithm can better classify the deformation field. Increasing the alarm window also increases the chance of getting a slip event in the given window. Therefore, an improvement of prediction performances is expected. We show that the improvement we get is more significant than what would be expected by chance using error diagrams (Figures 3b-3f). Error diagrams plot the fraction of alarm time versus the fraction of failures to predict (i.e., 1-fraction of events correctly predicted), and they are considered a useful earthquake predictability measurement (Kagan, 2007). In these diagrams an optimal prediction would correspond to a point near the diagram origin at (0, 0), while a random forecast would fall along the diagonal connecting (0, 1) and (1, 0). We labeled a predicted event if at least one declared alarm falls within the observed alarm window. Each point on the graph represents the prediction at a given target point for different training window lengths and density of stations. Comparing various models with increasing alarm durations, we observe that the cloud of points moves progressively toward smaller values of fraction of failure to predict with only a minor shift toward large fractions of alarm time, as highlighted by the downward shift of centroids (i.e., the arithmetic mean position of all the points in the figure) of the points in error diagrams. This indicates that models with longer alarm durations are more precise while requiring almost the same number of declared alarms as models with short alarm durations ( Figure S8).
Also apparent from our results is that the availability of data with high spatial resolution is less important than having access to long time series in which the investigated phenomena repeat several times (e.g., 10 times). The observation that a better classification is achieved for models with longer alarm durations raises the additional argument of the impact of sampling rate, because sampling interval together with alarm duration contribute to the number of data with alarm labels. To test the role of sampling interval, we run seven additional models with fixed parameters (i.e., alarm duration of 5 s, training window length of 200 s, and density of stations equal to 6/dm 2 ) and varying sampling intervals in the 0.13-3.25 s range. We found that, with exception of the smallest interval, classifications become progressively more reliable reducing the acquisition interval ( Figure S9). This suggests that possibly, a better classification may be achieved also for short duration alarms by monitoring the experiment with higher frequency. This observation supports a scenario where GNSS, given the higher acquisition frequency, is more useful than InSAR when attempting to export the proof of concept tested in this study to natural subduction zones.

Unlocking the Possibility to Predict the Onset of Slow Earthquakes
Our analysis showed that, in reconstructing the spatiotemporally complex forearc loading history, ML can predict the timing and size (tracking simultaneous alarms at various target points) of analog megathrust earthquakes under given circumstances of relatively long alarms and a relatively long observation record. Laboratory models represent the ideal candidates for attempting this type of prediction given their ability to produce the necessary observational data and to repeat a given process several times (e.g., Corbi et al., 2019a;Rouet-Leduc, 2018). Here we have investigated the impact of data space-time distributions on the ability of a ML-based technique to predict the onset of analog earthquakes using geodetic-like observations. We found that, having access to the deformation history of about 10 cycles at a limited number of stations, ML can reach a precision generally larger than 0.7 (0.5 minimum value over the nine target points; Figure 2b) and with very few false alarms (false positive rates <0.1; Figure 2e). In nature we do not have access to geodetic time series including multiple large subduction earthquakes to test this method. However, so-called slow events, with slip durations of few tens of days (Obara, 2002), are potential candidates to test our approach on. Cascadia is one of those subduction zones where roughly once per year (depending on the latitude) the megathrust hosts slow slip episodes and where the geodetic record is more than 10 years long (e.g., Michel et al., 2019). Given the similarity in space-time recurrence behavior (with partial ruptures alternating with larger events) and deformation pattern (with alternating trenchward and landward surface velocities), we suggest that ML could be applied to predict the onset and the extent of slow slip in this area.

Conclusions
We investigated the role of space-time coverage and alarm duration on the performance of analog earthquake prediction. We found that alarm duration plays a primary role in tuning the performances of a binary classifier. A sharp, accurate analog earthquake prediction is unfeasible with the algorithm used in this study, even in a simplified system with a perfectly designed monitoring network. But alarm periods become in reasonably good agreement with observed earthquakes when tens of seismic cycles have been recorded and when the alarm duration is longer. These results can be further improved by tuning the network design and acquisition rates. Given these findings, we propose that it might be possible to predict the onset and extent of slow earthquakes in real subduction zones by applying similar ML approaches to those developed in this study on GNSS time series.

Data and Material Availability
All data and materials used in the analysis are available through GFZ Data Services and published open access in Corbi et al. (2019b).