Graph-Partitioning Based Convolutional Neural Network for Earthquake Detection Using a Seismic Array

We present a deep-learning approach for earthquake detection using waveforms from a seismic array consisting of multiple seismographs. Although automated, deep-learning earthquake detection techniques have recently been developed at the single-station level, they have potential difficulty in reducing false detections owing to the presence of local noise inherent to each station. Here, we propose a deep-learning-based approach to efficiently analyze the waveforms observed by a seismic array, whereby we employ convolutional neural networks in conjunction with graph partitioning to group the waveforms from seismic stations within the array. We then apply the proposed method to waveform data recorded by a dense, local seismic array in the regional seismograph network around the Tokyo metropolitan area, Japan. Our method detects more than 97% of the local seismicity catalog, with less than 4% false positive rate, based on an optimal threshold value of the output earthquake probability of 0.61. A comparison with conventional deep-learning-based detectors demonstrates that our method yields fewer false detections for a given true earthquake detection rate. Furthermore, the current method exhibits the robustness to poor-quality data and/or data that are missing at several stations within the array. Numerical experiments using subsampled data demonstrate that the present method has the potential to detect earthquakes even when half of the normally available seismic data are missing. We apply the proposed method to analyze 1-h-long continuous waveforms and identify new seismic events with extremely low signal-to-noise ratios that are not listed in existing catalogs. We also show the potential portability of the proposed method by applying it to seismic array data not used for the training.

hypocenters (Kriegerowski et al., 2019;Zhang et al., 2020), and determining the association of seismic phases across an seismic array (McBrearty et al., 2019). These studies have shed light on the effectiveness of employing CNNs to analyze large seismic data sets. Many other deep-learning-based studies have also been developed for various seismic analyses (Kong et al., 2019).
Here, we present a new CNN-based method for earthquake detection using waveform data recorded in a seismic array. Earthquake detection at the single-station level has a potential inability to prevent false detection, because each station records its own local noise that originates from the near-surface environment. The use of multiple seismic stations can help reduce false detection and identify very weak seismic signals (Bergen & Beroza, 2018;Li et al., 2018;Peng & Zhao, 2009). Our goal here is to develop CNNs for earthquake detection using seismographs recorded by a local seismic array.
Besides deep learning methods, there exist various types of earthquake detection methods. Energy-based methods such as the short-term average/long-term average (STA/LTA) have been implemented in the current routine (Allen, 1978;Withers et al., 1998), but have the limitation to detecting low signal-to-noise earthquake signals and have resulted in numerous false positives during an intensive seismic activity. Waveform similarity search using waveform templates, known as template matching, have demonstrated success in finding low signal-to-noise earthquake signals (Gibbons & Ringdal, 2006;Kato et al., 2016;Peng & Zhao, 2009;Ross et al., 2019;Shelly et al., 2007;Skoumal et al., 2014). Subspace detectors (Barrett & Beroza, 2014;Benz et al., 2015;Harris, 2006) generalize template matching by principal component analysis. Yet, these template-based methods have the potential limitation to identifying earthquakes with hypocenters and source mechanisms that are not similar to those of templates. Auto-correlation approach (Brown et al., 2008) searches similar waveforms in continuous records without using templates, but its scaling limit with respect to the record length has been reported (Aguiar et al., 2017). Recently, Fingerprint and Similarity Thresholding (FAST) has been developed as a blind, waveform-similarity based detector that does not require templates and has a computationally efficient search algorithm (Bergen & Beroza 2018;Bergen et al., 2016;Yoon et al., 2015). Characteristically, both FAST and deep learning algorithms extract major features from seismic waveform. Compared to FAST, deep learning extracts features in a training-data set-depending way and compresses them to fixed-size parameters, implying a constant memory usage. Then it can help reduce computational runtime and memory usage (Perol et al., 2018).
However, we need to carefully assess CNN approaches to waveforms retrieved by a local seismic array, as CNN inputs are designed to be equispaced (see Chapter 9 in Goodfellow et al., 2016), whereas the actual station distribution in an array is usually nonequispaced. CNN is a neural network tailored for processing objects with a known grid-like topology such as 2D images, and the operations in CNN such as convolution and pooling (described in Section 2) assume the inputs to have a regular grid topology. It is known that the violation of this assumption leads to degrading the performance and there is a room for the improvement (Bruna et al., 2014;Duvenaud et al., 2015;Niepert et al., 2016). We employ a graph-partitioning method, modularity maximization, to accommodate such a nonequispaced distribution by treating the seismograph network as a graph, with each node representing a seismic station in the array. Modularity maximization is a node-clustering algorithm that automatically produces sub-graphs from a given graph. We first group the seismic stations into clusters via modularity maximization and then stack the waveforms, which are convolved with learned filters in CNNs, from each partition. This step is regarded as average pooling with respect to the seismic stations. In conjunction with the max pooling with respect to time, this step reduces effective parameters in CNNs and leads to the improvement of the performance.
We apply our method to waveform data recorded by MeSO-net Kano et al., 2017;Kasahara et al., 2009;, with a focus on the earthquakes that occurred in the Kanto district, Japan, from September 4-16, 2011. MeSO-net is an online high-density seismic network that consists of approximately 300 accelerometers installed in and around the metropolitan area in the Kanto district (Figure 1). We adopt the 13 seismic stations that are located approximately in the center of the seismic activity around the Kanto district as the seismic array for our analysis. We compare our method with CNNs for a single station and a seismic array consisting of multiple stations without any pooling with respect to the stations. This comparison illustrates that our method markedly improves the detection accuracy of CNNs for analyzing seismic signals, particularly those with a low signal-to-noise ratio (SNR). We also demonstrate the robustness of our method for analyzing incomplete seismic records (e.g., due to sensor problems) and temporary increases in noise levels at several stations.

Methods
An outline of our method is presented in Figure 2. The input consists of two-dimensional (2D) arrays of the three-channel waveforms (Up-Down, North-South, and East-West) recorded at the seismic stations within our local seismic array, where the horizontal axis of the 2D arrays is the elapsed time from a reference time before the P-wave onset, and the vertical axis corresponds to the station alignment. The output is the probability of the input data containing an earthquake record. Our method uses an undirected weighted graph that represents the geometry of the seismic stations within the array. The nodes of the graph represent the stations within the seismic array, and the weights of the edges represent the closeness of the stations, YANO ET AL.   which are discussed below. We first introduce the two building blocks (CNN and graph partitioning) of our method, following which we present the resultant architecture that is a combination of these two blocks.

Convolutional Neural Networks
CNNs are one of the most powerful nonlinear feature extractors and can learn a large number of digital filters in combination with fully connected neural networks. CNNs use these learned filters to also provide a function that maps an input (e.g., three-component 2D arrays that represent the observed waveforms from a seismic array) into an output value (e.g., the probability that the input data contains an earthquake).
CNNs build upon a series of layers that sequentially process the input data as follows: (a) convolve (or, more precisely, cross correlate) the input with a set of learnable filters; (b) downsample the convolution output; and (c) apply a nonlinear transformation referred to as the "activation function" to the downsampled output. The first step (a) is termed the "convolution layer": Let for i = 1, …, P to maintain the same output dimension of the pooling operation as its input dimension. These pooling operations have different roles: Max pooling highlights the sharp changes in the convolved waveform, whereas average pooling smooths these changes. The third step (c) is the activation layer, which makes the output nonlinear and therefore enables the network to express complex features.
However, the challenge of accommodating a CNN to the spatial distribution (geometry) of the stations to boost the power of the CNN remains when applying the CNN to seismograms from a seismic array. Here, we take advantage of graph partitioning to address this challenge, as discussed below.

Graph Partitioning
Graph partitioning is a way to reduce a graph to smaller sub-graphs by partitioning a set of nodes into mutually exclusive groups and an option of clustering methods. Several methods for graph partitioning exist (Bichot & Siarry, 2013;Fortunato & Castellano, 2012). Here, we employ modularity maximization (Newman, 2006;Newman & Girvan, 2004), which finds a graph partition that maximizes the modularity score. The modularity score for a given graph partition is defined by: where A = (A i,j ) is the adjacency matrix of the graph, m is the number of edges of the graph, and k i is the degree of node i (i.e., the number of connections that node i has to other nodes), c i is the partition to which YANO ET AL.

10.1029/2020JB020269
node i belongs, and   if a = b and 0 otherwise. It is shown that k i k j /2m is the expected number of edges between nodes i and j under the assumption that a graph is randomly generated according to the configuration model. Therefore, the modularity score represents the number of edges within given partitions minus the expected number of randomly generated edges in an equivalent graph. Modularity maximization for graph partitioning stems from the idea that each partition within the "best" graph partition has edges that do not appear in the configuration model. This approach has been applied to a variety of research fields, including bioinformatics, brain science, and social science (Mill et al., 2008;Xia et al., 2013).
We use modularity maximization in the CNN pooling layer via the following process: (i) We first create a graph that represents the seismic array by defining the weighted undirected graph with nodes that correspond to the stations and an adjacency matrix: : exp , if , and : 0 otherwise, where d (i, j) is the distance between stations i and j,  is a sensitivity parameter, and L is a cut-off distance parameter. We note that the determination of d (i, j) is important. In this study, we use the usual Euclidean distance between stations because this distance is generally robust for earthquake patterns that have various epicenters and ray-paths. The role of the sensitivity parameter  is to control weights between stations that produce an adjacency matrix. That of the cut-off distance L is to sparsify an adjacency matrix. (ii) We then apply modularity maximization to the weighted undirected graph to obtain the graph partition. Different values of hyperparameters    , L yield different partitions. Applying modularity maximization to a station graph with larger value of L and smaller value  yields a station partitioning that has more stations within the same partition. (iii) We finally convolve the waveforms with learnable filters in the CNN at each station and stack the waveforms at the stations belonging to the same partition. We essentially conduct average pooling with respect to the station axis by using the partitions.
The core idea of this operation is that the waveforms convolved with a learned filter at the closer stations become similar to each other. Therefore, stacking (average pooling) along with graph partitioning can further improve the quality of the learned features. The choices of d, , and L may be important, but these can be chosen in a data-driven way, as discussed in the following subsection.

Resultant Architecture
We propose the following architecture, which is a combination of the CNN and graph partitioning. We first convolve the input waveform at each station with filters, followed by waveform processing at each station via max pooling with respect to the time axis. We then conduct average pooling, along with graph partitioning (as described in the previous subsection) to the waveforms. We finally apply the fully connected neural network to the stacked waveforms and obtain the output probability.
In this study, we specified the hyperparameters in the CNN as follows. The number of learnable filters in the convolution layer was 30, with each being 1 s in length. The activation function was a rectified linear unit (Fukushima, 1980), x . The length of max pooling was 0.25 s. The number of units in the fully connected layer was 40. The batch size during the training step was 20. The filters in the CNN were optimized during the training step that was conducted by minimizing the cross-entropy loss using stochastic gradient descent optimization (Robbins & Monro, 1951) with momentum (Qian, 1999), where the learning rate was 0.005 and the momentum was 0.9. The cross-entropy loss is a function that maps a pair of true label y (y = 1 for earthquake and y = 0 for noise) and output probability p for an earthquake into −y logp − (1−y) log(1−p). Stochastic gradient descent optimization with momentum uses a linear combination of the gradient multiplied by the learning rate and the previous update multiplied by the momentum as the next update. The l 2 penalty with the regularization parameter 1.0 (Ng, 2004) was added to the loss, which reduces the potential overfitting.
The hyperparameters  and L in Equation 2 were determined via hold-out validation. We set the candidates for  and L to {0.1,0.5,1.0,5. 0} and {3,4,5,6,7,8,9,10,11,12}, respectively, with   5.0 and L = 8 determined. This set, in conjunction with the modularity maximization, generates sufficient varieties of station partitions ( Figure S1), so we used it. We randomly split the training data set into the training part (80% of the data set) and the test part (the remaining 20%). The training part was used for the training step, and  and L were chosen to minimize the cross-entropy loss of the test part.

Data
We applied the proposed method to seismic records from a dense regional seismograph network, MeSO-net, deployed in the Tokyo metropolitan area, Japan, to demonstrate its performance. MeSO-net consists of 296 accelerometers at an average station interval of approximately 5 km. The station intervals in the central part of the station network are in the 2-3 km range. The observed waveforms contain various types of noise signals even though the accelerometers were installed approximately 20 m below the surface. The main source of noise signals in MeSO-net is human activities. Human activities generate a lot of environmental noises such as trains, automobiles, aircrafts, and factories. The exhaustive study on noises observed at MeSO-net stations (Kawakita & Sakai, 2009) reports representative noise waveforms with their spectrograms. These noise signals appear only randomly in time at individual stations, whereas the seismic signals do not. Therefore, analysis of the information commonly recorded in the dense station network potentially improves the capability of earthquake detection.
Here, we selected 13 stations located at the eastern part of MeSO-net as the target seismic array (Figure 1) because they are distributed approximately in the center of the seismically active area in and around the Kanto district (Figure 3). Therefore, this seismic array provided the most earthquakes determined by visual picking that are available for supervised learning by the CNNs. Three-component continuous data were recorded at a 200 Hz sampling rate at each seismic station in the array. We downsampled the records to 25 Hz and then applied a 2-8 Hz bandpass filter to remove background noises. Background noise levels at this frequency range (2-8 Hz) are relatively small (Nishida et al., 2008) and signals of earthquakes are clearly identified in stations of MeSO-net. We detrended an input waveform and normalized it using the median of the maximums of each waveform.
The Earthquake Research Institute, The University of Tokyo, identified 599 earthquakes from MeSO-net data that spanned the September 4-16, 2011 time period (Figure 3a). A total of 6,241 noise signals were also obtained as a by-product of applying STA-LTA method to the continuous waveform data. We manually eliminated all of the signals with very small SNRs at the target array after checking all of the potential earthquakes recorded by the array. This elimination is necessary to mitigate potential deterioration of the performance of YANO ET AL.  machine learning methods due to mislabeling (Frénay & Verleysen, 2014) as low SNR events because they can be buried in seismic noise and lead to being mislabeled as noises. Finally, we used 527 earthquakes.
We selected the waveforms from 209 available earthquakes that were detected during September 4-8, 2011 as the training earthquake data set. We also used the waveforms from 927 M L > 2.8 earthquakes that occurred during the 2011-2016 period (excluding September 4-16, 2011) for the training earthquake data set. We extracted 20-s-long time windows that contained the onsets of both P-and S-wave arrivals for all of the available events where at least three stations recorded detectable P-and S-wave onsets. For the purpose to detecting earthquakes in the Tokyo metropolitan area (the epicenter distance is about 160 km from the target array [13 stations]), we determined the window length to be 20 s to contain both P and S-wave arrivals of each event. Figure 3b shows the distribution of epicenters of these training earthquakes. We employed the waveforms from 318 earthquakes that were detected during September 9-16, 2011 as the validation earthquake data set. We used the noise signals identified by the STA-LTA technique, with each waveform starting within 5 s of the onset of each noise signal, as noise. We have tried several variations of the start time (    1, 2, 3, 4), but there is no significant change in the performance of our method. A total of 6,241 noise signals with 20-s-long time windows from September 4-16, 2011 were used: those from 4-8 September for training, and those from 9-16 September for validation.
We employed the data set augmentation technique owing to the relatively small size of the original training data set and the data imbalance between earthquake data set and noise signals. Specifically, we injected zero-mean Gaussian white noise signals with randomly determined scales into the training windows. The distribution of scales was an exponential distribution with a mean of 0.001. This data set augmentation technique alleviates potential overfitting and data imbalance (Sietsma & Dow, 1991), and has been used in applying neural networks to seismological studies (Mousavi et al., 2020;Perol et al., 2018).
We briefly mention the results from applying modularity maximization to the target seismic array when L and  in Equation 1 are fixed to values that are specified via hold-out validation. The resultant partition is illustrated in Figure 2. Modularity maximization using the Euclidean distance achieves a reasonably equispaced station distribution within the same partition. Furthermore, histograms of the S-P arrival-time differences are quite similar to each other within the same partition, as shown in Figure S2, indicating that graph partitioning with the optimized hyperparameters works well.

Performance of Our Method
We begin with reporting the performance of our method by varying minimum threshold values for the output earthquake probability. Hereafter, we refer to this threshold as the "minimum detection probability (MDP)." Both true positives (classifying a true earthquake as an earthquake) and false positives (classifying a true noise as an earthquake) become smaller as MDP increases.
As typical cases we present results calculated for three MDPs (0.5, 0.61, 0.9) in Table 1. Our method detects almost all of the earthquakes in the case where MDP is set to 0.5, but it has a relatively large number of false positives. The number of false positives is significantly reduced when MDP is set to 0.9, but fewer earthquakes are detected in the process. MDP of 0.61 results in >97% true positives among the true earthquakes and <4% false positives among the true noise signals. We chose 0.61 of MDP as an optimal MDP for our method from the receiver operating characteristics (ROC) curve in Figure 4. The ROC curve measures the classifier performance as the tradeoff between the true positive and false positive along the range of MDPs. The choice of 0.61 is the closest to the upper-left corner, which indicates that the proposed method yields nearly the ideal/best performance at this MDP. The ROC curve of our method also shows that the false positives are <1% when MDP is large (>0.8), <10% when MDP is >0.5, and increases markedly when MDP is <∼0.35. YANO ET AL. Notes. The true positives are events that have been classified as earthquakes from true earthquakes, and false positives are events that have been classified as earthquakes from true noise. We used 318 true earthquakes and randomly selected 500 true noise signals. Abbreviation: MDP, minimum detection probability.

Table 1 Resultant True Positives and False Positives of Our Method for Three Different MDPs
We also compared the ROC curve of our method with those of two different CNNs, namely, a CNN for a single station (CNN@station name) and the conventional CNN for multiple seismic stations (conventional CNN), as shown in Figure 4. The conventional CNN means a CNN without any pooling operation along the station axis, which is commonly employed in multi-station set-ups. We note that our method covers the conventional CNN as a specific case when L is less than the minimum of the distances between stations, and also the CNN for a single station when applied to a single station. We adopt E.IIDM as the single station used in a CNN for a single station because E.IIDM has the highest SNR among the stations in the seismic array ( Figure S3).
The ROC curves of the three methods highlight that our method outperforms the other methods ( Figure 4). Our method has the lowest false positive rate among the three methods when the same true positive rate is given. Furthermore, a comparison of the three methods with the optimal values of MDPs (the closest points to the upper-left corner in the ROC curves) indicates that our method attains both the most true positives and fewest false positives. Figure S4 entails more detailed comparison with different CNNs.

Temporal Variation in Output Probability When Detecting an Event
We checked time variation of the performance of our method. We calculated the output probability by applying our method to 2-8 Hz-bandpass-filtered waveform records. We obtained a time series of output probabilities by shifting the 20-s-long window in 0.02 s increments. Figure 5 shows a time variation of the output probability, which shows that the output of our method exhibits reasonable behavior sensitive enough to detect seismic signals: The output probability suddenly increases and exceeds 0.61 when P-waves have arrived in the time window at several stations (1), 0.95 when P-waves have arrivals in the time window at almost all of the stations (2), and 0.98 when both P-and S-waves appear in the window at almost all of the stations (3), and gradually reduces to <0.61 when the S-wave arrivals are outside of the time window at several stations (4).

Effects of the Signal-to-Noise Ratio on Event Detection
We introduced an SNR-based metric, the median of the ratios of the in-event median absolute deviances (MADs) to the pre-event MADs, to quantify the impact of the SNR on the performance of the method. Specifically, we defined the median MAD ratio (MADR) as: YANO ET AL.

10.1029/2020JB020269
8 of 17 pre, pre, median : 1, , median MADR median : 1, , , median : 1, , where T is the time length, N is the number of stations, {a pre, i (t):t = 1, …, T} is the pre-event recorded amplitude at station i, {a in, i (t):t = 1, …, T} is the in-event recorded amplitude at station i,  pre,i a is the median of apre , i (t) with respect to time t, and  in,i a is the median of a in, i (t) with respect to time t. We used the 20-s-long waveforms that started 30 s prior to P-wave arrivals as the pre-event waveforms. We defined the three-component YANO ET AL.
10.1029/2020JB020269 9 of 17 median MADR as the median of three median MADRs that were calculated from the Up-Down, North-South, and East-West components. The median MADR is a median version of the root-mean-square (RMS) deviation ratio and is more robust than the RMS deviation ratio when the waveforms are contaminated (see Chapter 5 of Huber & Ronchetti, 2009). Figure 6a shows that the lower bound of the output probabilities of events with the same median MADR increases with increasing median MADR. Our method generally misses earthquakes with a small median MADR (i.e., a small SNR). Similar behaviors are observed in the conventional CNN ( Figure 6b). Yet, we find that our method generally yields higher output probabilities compared with the conventional CNN results. In addition, the performance deterioration of our method is slower than that of the conventional CNN, which is confirmed by comparing the largest values of the median MADRs that have the output probabilities below the optimal MDPs. The combination of a higher output probability ( Figure 6) and lower false detection rate (Figure 4) indicates that the appropriate clustering of stations in a seismic array greatly improves CNN-based earthquake detection.
It is worth noting that events with median MADR values of <1.0 can in some cases be detected with relatively high output probabilities. This means that changes in the pre-and in-event amplitudes are not significant for such events. Further analysis of the waveform and/or amplitude features of these events may improve the detection accuracy of earthquakes with small SNRs.

Robustness With Respect to Missing Data
Continuous seismic records may be incomplete owing to temporary power outages, as well as failures in seismic sensors and/or data loggers. Therefore, we examined the robustness of our proposed method with respect to such incomplete data records to investigate its performance and stability. We created subsampled YANO ET AL.

10.1029/2020JB020269
10 of 17 data that consisted of waveforms with missing data using the validation data set (318 earthquakes). We also randomly removed seismic stations and created 10 subsampled data sets with different numbers of missing stations for the earthquakes in the validation data set. Figure 7a illustrates how the number of missing stations changes output probability for 3,180 trials, with a drop in performance as the number of missing stations increases. Increasing the number of missing stations from one to six results in 83%, 80%, 78%, 74%, 68%, and 58% of the trials possessing output probabilities of >0.61, respectively. It also shows that the median output probability remains above 80%, regardless of the number of missing stations, which implies that performance deterioration is relatively low for half of the trials. These two observations suggest that data quality, in the form of missing data, impacts on performance deterioration. We selected three events with different median MADRs to illustrate this point (Figure 7b). The event with the highest median MADR (=95.30) possessed an output probability of approximately 1.0, even when the data loss arising from missing stations increased. The event with the median MADR nearly 1.0 possessed an output probability that decayed as a function of data loss and remained high (>0.9) until the number of missing stations increased to six. The event with the low median MADR (=0.77) possessed an output probability that decayed substantially, falling below 0.61 in cases where the number of missing stations increased to five. Given the high quality of the data, our CNN-based method can reliably evaluate earthquakes, even if up to half of the seismic data are missing.

Continuous Application to the September 9, 2011 Data
We applied our method to 1-h-long continuous waveform data, starting at 0:00 JST on September 9, 2011, to explore the feasibility of the proposed method in detecting earthquakes from continuous streaming data. We used the data because a large number of earthquakes are excited by the 2011-March-11 Tohoku-oki earthquake (M9) even in the Kanto district (e.g., Ogata et al., 2019). We calculated the temporal change in the output probabilities by shifting the 20-s-long window in 0.02 s increments over the entire hour. We YANO ET AL. picked event candidates that possessed output probabilities of >0.61 for more than 1 s (yellow stars and orange stars in Figure 8b).
A comparison of the time series of the median MAD at the stations within the array and the output probabilities from our method (Figure 8) reveals several sharp peaks in the output probability sequence that are synchronous with local peaks in the median MAD sequence, with four event candidates being detected with a high degree of confidence. Two of these event candidates correspond to earthquakes that had been detected by JMA and MeSO-net (orange stars in Figure 8), thereby implying the validity of our method in detecting earthquakes using the seismic array. Although the other two candidates (yellow stars in Figure 8b) were not detected in the routine MeSO-net detection procedure, we consider these to be actual earthquakes, as seismic signals can be identified in the nationwide seismic network in the Kanto district (e.g., high-sensitivity seismograph network in Japan (Hi-net); Obara et al., 2005;Okada et al., 2004) at the time of these two events. The relocated hypocenters of these events, which were determined from arrival times picked visually, fall within the lateral range of the training earthquakes (Figure 8e). The event at approximately 00:50:20 JST has a very high (>0.99) output probability (Figure 8d). This high output probability and the proximity of the event to a cluster of training earthquakes imply that events occurring near training earthquakes can be detected with a high output probability. MeSO-net consists of a large, dense distribution of seismic stations around the Tokyo metropolitan area compared with the nationwide seismic network. Therefore, the level of performance obtained in this study suggests that implementation of a suitable method can improve the ability to detect and monitor small earthquakes that occur in and around metropolitan areas.

Application of Our Method to Seismic Array Data Not Used for the Training
In order to demonstrate the performance in applying our method to untrained seismic arrays, we checked the concordance of the output probabilities with the earthquake signal arrivals in several seismic arrays not used for the training.
We used four untrained arrays, each of which consists of 13 seismic stations, as shown in Figure 9a. These four untrained arrays have different distances from the trained array, so we chose them for the following evaluations. We note that our detection method can be applied to an untrained array when the following two conditions hold. The first condition is that a seismic array contains at most 13 stations. The second condition is that an array has to be divided into 3 partitions. In this experiment, we selected the hyperparame- in Equation 1 for the untrained arrays as in Figure S5. We used a waveform of 100-s long window starting from 23:29:56 09/09/2011 (JST) as depicted in Figure 9b. Within this time window, two successive earthquakes are contained. These events with magnitudes M j 3.9 and M j 2.9 closely occurred with an origin time difference of ∼30 s; their origin times are 23:29:52 (JST) and 23:30:20 (JST), respectively. The source depths are 4.4 km (0.9) and 5.2 km (0.5), respectively. We calculated the temporal change in the output probabilities by shifting the 20-s-long window in 0.02 s increments.
We first confirm that in the trained array, two events have high SNR and the output probabilities are concordant with the arrivals of these events. In Untrained Arrays 1, 2, and 3, our method correctly detects two events with the high output probabilities (>0.61). In Untrained Array 4, our method successfully detects the first event but misses the second event. The missing of the second event in Untrained Array 4 is due to a low SNR (Figure 9f). In summary, our method can be applicable to untrained networks under the above two conditions.

Response to Teleseismic Earthquakes
Here, we scrutinized the response of our method to waveforms generated from distant earthquakes with the moderate-to-large magnitude. As an example, we show the time series of the obtained output probability with the 2-8 Hz-bandpass-filtered waveform records from the 2011 M w 7 Vanuatu earthquake (the epicentral distance is about 6,400 km) in Figure S6. The output probability fluctuates around the optimal MDP line (0.61) when the direct P-waves arrive at the array. Although our detection method reacts when the seismographs have sufficient amplitudes in 2-8 Hz, the output probability does not indicate a large value that far exceeds the optimal MDP line (0.61). We obtained the similar temporal changes in the output prob-  abilities for the other distant events. Thus, our method results in a weak detection at the timing of arrival of waveform from a distant event.
To improve the performance of our method, an idea is to learn teleseismic earthquakes as one category of noise signals in the framework of CNNs. It is thus necessary to develop an appropriate classification method for noise signals before applying the supervised learning. Such idea would be implemented in combination with an unsupervised learning (Johnson et al., 2020).

Conclusion
We developed a CNN that incorporated graph partitioning for earthquake detection using a multi-station local seismic array. The performance of our method is validated by both its high true detection rate and low false detection rate. Our method outperforms two different conventional types of CNN-based detector. We found that the temporal variations in the output earthquake probability of our method during event detection are concordant with the P-and S-wave arrivals in the array. We used the median MADR as a measure of the SNR to confirm that the SNR has some influence on the output earthquake probability of our method; however, some low-SNR events can be detected with relatively high output probabilities. Numerical experiments using subsampled data demonstrated that our method exhibits robustness when some data are missing. We also identified new seismic events with extremely low SNRs that are not listed in existing earthquake catalogs by applying our method to 1-h-long continuous waveform data. We applied our method to untrained arrays to show the potential portability of our method.