SeismoGen: Seismic Waveform Synthesis Using GAN With Application to Seismic Data Augmentation
Abstract
Detecting earthquake arrivals within seismic time series can be a challenging task. Visual, human detection has long been considered the gold standard but requires intensive manual labor that scales poorly to large data sets. In recent years, automatic detection methods based on machine learning have been developed to improve the accuracy and efficiency. However, the accuracy of those methods relies on access to a sufficient amount of high-quality labeled training data, often tens of thousands of records or more. We aim to resolve this dilemma by answering two questions: (1) provided with a limited amount of reliable labeled data, can we use them to generate additional, realistic synthetic waveform data? and (2) can we use those synthetic data to further enrich the training set through data augmentation, thereby enhancing detection algorithms? To address these questions, we use a generative adversarial network (GAN), a type of machine learning model which has shown supreme capability in generating high-quality synthetic samples in multiple domains. Once trained, our GAN model is capable of producing realistic seismic waveforms of multiple labels (noise and event classes). Applied to real Earth seismic data sets in Oklahoma, we show that data augmentation from our GAN-generated synthetic waveforms can be used to improve earthquake detection algorithms in instances when only small amounts of labeled training data are available.
Key Points
-
We develop a generative adversarial neural network model that is capable of synthesizing three-component waveforms of multiple labels
-
We validate the synthetic waveforms both visually and quantitatively through use of a machine-learning based earthquake classifier
-
We demonstrate that our synthetic waveforms can augment real seismic data to improve machine learning-based earthquake detection methods
1 Introduction
The detection of earthquake events within seismic records plays a fundamental role in seismology. However, this task can be challenging in practice. Seismic waveforms have unique characteristics compared to time series from other physics domains, and require intensive training and domain knowledge to accurately recognize and characterize them. Automated seismic detection methods have been deployed for decades, with the most popular methods including short-time-average/long-time-average (Allen, 1978) and waveform correlation approaches (Gibbons & Ringdal, 2006). However, these computational detection methods may sometimes generate too many false positives, can fail in situations with low signal-to-noise ratio, and often suffer from expensive runtime costs (Yoon et al., 2015).
Despite these limitations, the need for more advanced, automatic, and efficient earthquake detection algorithms is becoming more urgent due to the rapid increase in the volume of available seismic data to process. In recent years, machine learning methods using deep neural network (DNN) architectures have been successfully deployed in the computer science community in object detection tasks and to identify patterns within image data sets (He et al., 2016; Huang et al., 2017; Krizhevsky et al., 2012a; Simonyan & Zisserman, 2014; Szegedy et al., 2017). These successes have motivated seismologists to pursue the development of DNN-based earthquake detection methods. Perol et al. (2018) introduced a convolutional neural network (CNN) architecture (“ConvNetQuake”) to study induced seismicity in Oklahoma. Ross et al. (2018) leveraged the vast labeled data sets of the Southern California Seismic Network archive to develop the CNN-based Generalized Phase Detection algorithm, while Zhu and Beroza (2019) developed a similar approach called PhaseNet using data sets from northern California. Taking advantage of the temporal structure of seismic waveforms, Mousavi et al. (2019) used a hybrid convolutional and recurrent neural network architecture in devising the CRED algorithm. An “image-to-image” idea and a fully convolutional network have been utilized to characterize induced earthquakes in Oklahoma (X. Zhang et al., 2020). Mousavi et al. (2020) developed a global deep learning model to perform a multi-task learning for both earthquake detection and phase picking. Several other studies have built on and modified these approaches, applying them to various problems in seismology (Dokht et al., 2019; Kriegerowski et al., 2019; Linville et al., 2019; Lomax et al., 2019; Meier et al., 2019; Tibi et al., 2019). Some recent reviews can be found in Bergen et al. (2019) and Kong et al. (2019). In this article, we apply the “DeepDetect” detection method (Wu et al., 2019), which is a cascaded region-based CNN designed to capture earthquake events in different sizes while incorporating contextual information to enrich features for each proposal, and the work of Z. Zhang et al. (2019), which implemented a deep learning-based earthquake/non-earthquake classification model with an adaptive threshold frequency filtering module to achieve superior performance.
Two important challenges for predictive, supervised learning models like those used for earthquake detection tasks are their generalization ability and robustness in out-of-sample applications. These topics are a critical consideration not only for seismology problems but also for a broad suite of deep learning applications (Kawaguchi et al., 2020). A lack of generalization ability can lead to a situation where the predictive model achieves a satisfactory performance on the training set but significantly degrades in performance when applying to an unseen (out-of-sample) data set. The high capacity of DNNs is the root cause of the lack of strong generalization, which can lead the model to “memorize” the training set and result in overfitting. Various techniques in the machine learning community have been proposed and developed to alleviate the overfitting and enhance the generalization ability. These include advanced regularization techniques such as dropout, batch normalization, and early stopping (Kukacka et al., 2018), transfer learning approaches (Tan et al., 2018a), and data augmentation (Shorten & Khoshgoftaar, 2019).
In the seismology community, there have been several recent works studying the importance of the model generalization (Chai et al., 2020; Liu et al., 2020; Mousavi et al., 2020; Park et al., 2020; K. Wang, Ellsworth, & Beroza, 2020; R. Wang, Schmandt, et al., 2020). As demonstrated in most of these recent studies, DNN-based models yield some generalization ability that would allow for a model being trained on one data set to then be applied to a different study region. To improve the generalization ability, Chai et al. (2020) recently developed a transfer learning technique for the PhaseNet model (Zhu & Beroza, 2019) that is capable of boosting the performance of PhaseNet when applied to new contexts by retraining the network.
Here, we build on this theme of generalization by developing an advanced data augmentation technique to synthesize realistic, labeled seismic waveform data that can be used as a complement to real Earth training data. Data augmentation techniques have been widely explored and applied in computer vision tasks (Shorten & Khoshgoftaar, 2019), and can be broadly categorized as basic manipulations such as geometric transformation, noise injection, or mixing data (Krizhevsky et al., 2012b), and generative-model based approaches (Antoniou et al., 2018; Frid-Adar et al., 2018). Regardless of the details of the technique, all data augmentation methods aim to expand the available training data by adding synthesized data with the hope to improve model performance. However, there is always a risk that through data augmentation, one may demolish the semantic content of the original data. This can occur, for example, when too many unrealistic synthetic data are added to the original data set. Compared to data augmentation using basic manipulations, generative models learn the intrinsic underlying distribution from real samples of data. Because of this, they are capable of producing realistic samples when properly trained.
Our methodology is based on a type of generative model called a generative adversarial network (GAN), which instantiates an adversarial min-max game between two networks called the generator and the discriminator (Goodfellow et al., 2014). The role of the generator is to synthesize realistic data by sampling from a simple distribution like Gaussian and learning to map to the data domain using a neural network as a universal function approximator. The discriminator, in contrast, is trained to distinguish this type of synthetic data from real data samples. This is achieved by adversarial training of these two networks. Novel methods have been successfully developed to apply GAN to image synthesis (Creswell et al., 2018; Goodfellow et al., 2014), audio waveform generation (Chen et al., 2017; Engel et al., 2019; Yang et al., 2017), and speech synthesis (Kaneko et al., 2017; Pascual et al., 2017; Saito et al., 2017). There are multiple variants of GAN, the most important for this article being the conditional GAN ( Mirza & Osindero, 2014), which turns the traditional GAN into a conditional model, allowing the user to customize the category of the generated samples with additional class label information.
In earthquake detection, there is surprisingly little work studying the effectiveness of data augmentation from generative models like GANs (Zhu et al., 2020). To our best knowledge, we have not yet seen a generative-model-based data augmentation strategy for earthquake detection. However, GAN models have been applied to more general problems in seismology. Z. Li et al. (2018) employ a GAN as a feature extractor to learn key signatures of waveform data, and then use a random forest classifier to distinguish events of interest from noise. They build their generative model using vertical-component P-wave data and a large number of positive and negative samples (a total of 650,000) to train their generative model. Building on these results, Meier et al. (2019) analyze and compare the performance of different neural networks-based classifiers, including the one developed by Li et al. (2018). Neither of these studies explores data augmentation techniques for seismic event detection, as their focus primarily relates to feature extraction using GAN. Generative models have also been proved to be effective in other geophysical applications such as inversion (Z. Zhang & Lin, 2020; Zhong et al., 2020), data processing (Picetti et al., 2019), interpretation (Lu et al., 2018), and many others.
The generative model developed here, which we call “SeismoGen,” is a conditional GAN that produces synthetic three-component seismic waveform time series including both P- and S-waves. Due to the conditional GAN structure, SeismoGen is capable of synthesizing data with multiple labels, in this case, arrivals from earthquakes and background noise. To evaluate the performance of SeismoGen, we apply and test it using waveform data acquired from three seismic stations in Oklahoma. We validate the quality of synthetic seismic events visually and quantitatively, and explore the feasibility of augmenting limited data sets with synthetic samples for earthquake detection problems.
The layout of this article is as follows. To begin, in Section 2, we provide details on the field data and preprocessing techniques. Next, in Section 3, we describe the fundamentals of GAN models and their variants, and then develop and discuss our SeismoGen model. Then, in Section 4, we describe experimental results. Finally, in Sections 5 and 6, we discuss model limitations, future work, and present concluding remarks.
2 Data Description and Preparation
2.1 Raw Seismic Waveform Time Series
Broadband seismometers are highly sensitive instruments that are capable of recording small earthquakes. This sensitivity comes with a tradeoff, as they will also record background noise and other non-earthquake signals. Earthquake detection can be posed mathematically as a classification problem, where the objective is to partition the observed waveforms into different classes. In the simplest case, which we adopt here, there are two classes of interest: Earthquake and non-earthquake (or noise).
The duration and characteristics of earthquake waveforms may vary significantly from event to event, depending on the source duration and mechanism, the source-receiver distance, and attenuation along the raypath in the shallow subsurface. However, all earthquake waveforms exhibit a universal set of features governed by underlying geophysical constraints. The physics of seismic wave propagation imposes temporal and polarization structure on earthquake waveforms. For example, P-waves arrive before S-waves and are typically of lower amplitude and more visible on vertical-component sensors. Any machine learning algorithm meant to synthesize realistic earthquake waveforms will need to account for these physical constraints in their model, either explicitly or implicitly.
2.2 Data Set Preparation
We use three field data sets to validate the performance of our model. Each data set is processed from raw waveforms data acquired at three stations from the Transportable Array (network code TA): V34A, V35A, and V36A. All three stations are located in the state of Oklahoma, approximately 60–80 km away from the Oklahoma City, as shown in Figure 1a. Station V34A operated at its Oklahoma site from October 31, 2009 through September 3, 2011. Station V35A operated at its Oklahoma site from March 13, 2010 through February 20, 2012. Station V36 A operated at its Oklahoma site from June 10, 2010 through February 9, 2012. All three stations are three-component low-broadband seismometers (channel codes BHE, BHN, BHZ) operated at sampling rate of 40 Hz.

Study region and data set overview. (a) Map of study region, including the TA network stations V34A, V35A, and V36A located in Oklahoma during our study period. (b) Joint scatter plot and histograms of event magnitude and source-to-station distance for the earthquake catalog data used to define training and testing data.
We use an earthquake catalog obtained from the Oklahoma Geological Survey (Oklahoma, 2011) to assemble training and testing data sets of earthquake arrivals. During the time of operation in our study area, we have arrivals from 1,025 earthquakes recorded at station V34A, 1,120 earthquakes recorded at station V35 A, and 767 earthquakes recorded at V36A. Overall, our data set (Figure 1b) concentrates on small magnitude earthquakes (ML < 3) recorded at local distances (R < 200 km). This is important to keep in mind, as the statistics of the input earthquake catalog will control the variability and performance of the generative model. An example recording of an earthquake arrival from this catalog is shown in Figure 2.

Illustration of an earthquake arrival obtained from station V34A. We include both the raw waveform (left column) and a high-pass filtered waveform (right column). The three rows in each figure show components BHE, BHN, and BHZ, respectively. P- and S-wave arrival times are denoted on the filtered waveforms with blue and red vertical lines.
In designing a machine learning-based detection algorithm, the maximum duration of an earthquake waveform is an important parameter to decide. We apply a consistent window size to all earthquakes here. We find that a window size of 40 s (1,600 time steps) to be a good option in that it is large enough to cover any individual earthquakes while small enough to facilitate efficient training (Z. Zhang et al., 2019). Our algorithm is thus designed to operate on time series samples defined as three-component vectors of length 1,600. We provide both positive and negative seismic samples based on whether or not there is an earthquake event included in the time series. This parameterization is sufficient for our purposes, as earthquakes in our data sets are relatively sparse in occurrence over time. We find that the duration between any two neighboring earthquakes in our catalog is never less than 3,200 time steps, so that any two consecutive earthquakes will not be included in the same positive sample of length 1,600 time steps.
With all the aforementioned details on our raw seismic waveform, we build our data set guided by four rules:
-
Each positive sample shall cover a single earthquake
-
Negative samples shall not cover any earthquake
-
Positive and negative samples shall not overlap with each other
-
The number of positive and negative samples shall be balanced
To better utilize the detected events from our earthquake catalog, we apply a basic data augmentation technique to expand the size of initial the data set. We use station V34A as an example to demonstrate this procedure. For each seismic event located at a time stamp t, we first sample three offsets o1, o2, and o3 from a discrete uniform distribution of Unif [−600, 600]. We then create three positive samples by segmenting three intervals length of 1,600 centered at t + o1, t + o2, and t + o3 on the raw waveform data. We repeat this procedure for each of the 1,025 events detected on V34A, providing us a total of 1,025 × 3 = 3,075 positive samples. We balance these positive samples by randomly selecting a total of 3,075 time segments with a length of 1,600 from the remainder of the raw seismic waveform. Eventually, the positive and negative samples together will result in a total data size of 6,150 for station V34 A. Similar procedures can be applied to stations V35 A. For V36 A, as the number of seismic events is limited to 767, we sample four offsets and create four positive samples from each event. Figures 3 and 4 compare waveforms from positive and negative samples on each station.

Illustration of positive waveform samples from each of the three stations V34A, V35A, and V36A. Each sample consists of a 40-s period of seismic waveform from a sampling rate of 40 Hz (1,600 points total, as seen on the x-axis). The left, center, and right columns show a three-component sample from each of the three stations. The top three rows show the raw waveforms of the positive samples and the bottom three rows show the corresponding filtered waveforms.

Illustration of negative waveform samples from each of the three stations V34A, V35A, and V36A. Each sample consists of a 40-s period of seismic waveform from a sampling rate of 40 Hz (1,600 points total, as seen on the x-axis). The left, center, and right columns show a three-component sample from each of the three stations. The top three rows show the raw waveforms of the negative samples and the bottom three rows show the corresponding filtered waveforms.
In raw seismic time series, the digitized values logged by the seismic stations are spread over a range of ∼±107 counts. To effectively learn the features of the seismic waveforms, the data set needs to appropriately normalized. To achieve this, we normalize waveforms on each channel by subtracting the mean and dividing by the standard deviation.
3 Model Design
3.1 Background: GANs
GANs are a family of deep-learning-based generative models that can be used to learn distribution and produce realistic synthetic samples. A typical GAN consists of two feed-forward neural networks: A generator and a discriminator. The generator learns a function that maps a prior vector to a realistic synthetic sample, while the discriminator reads in both real and synthetic samples and learns to distinguish between them.
Well-designed GAN models produce realistic samples. In order to account for the classes of data such as earthquakes and noise, there is a need to incorporate label information into the GAN. With an input of a label information y to both generator and discriminator, a traditional GAN can be turned into a conditional GAN (Mirza & Osindero, 2014). We will develop our SeismoGen based on the conditional GAN.
3.2 SeismoGen Model Design
The main structure of our SeismoGen is illustrated in Figure 5, which consists of two networks: the generator (Figure 5a) and the discriminator (Figure 5b). To increase the quality of the synthesized waveforms from different seismic stations, we train a separate GAN model for each station using the same network structure.

An illustration of the network structure of SeismoGen: generator (top) and discriminator (bottom). The data dimensions and mathematical operations for each layer are listed in each panel. Each long box in the figure represents a layer in our model. For example, “Conv 1 × 128, 16, /4, ReLU, BN” describes a 1D convolutional layer using 16 kernels with kernel size 1 × 128, stride 4, ReLU activation function, and batch normalization.
3.2.1 Generator Structure
We design our generator to comprise three pipelines to synthesize each component of the data individually. All three pipelines share the same input and follow an identical network structure as shown in Figure 5a, but they otherwise do not interact or share trainable parameters like weights. Each pipeline is a four-layer convolutional network. As shown in Figure 5a, the input vector z is a Gaussian noise vector of length 400, while is a binary scalar with
representing positive event. As both z and
become available, we pass z through a transposed 1D convolution layer to obtain an augmented 1D feature vector of length 1,600. In parallel, the scalar input of
is augmented to be the length of 1,600, and then further concatenated with the augmented z vector. Similar to the conventional DCGAN, we use an additional 3 layers of 1D convolution to synthesize one component data in the seismic sample of
. Similarly, we can obtain the other two components of the synthetic seismic sample of
through two other pipelines.
3.2.2 Discriminator Structure
The discriminator is used to evaluate the quality of input samples, that is, real or synthetic. The discriminator first learns features representative of seismic signals, including both earthquake and non-earthquake events, and further provides critics based on the features learned. The design of our discriminator includes two sequential modules: “feature extraction” and “sample critic.” The feature extraction module learns a feature vector that efficiently characterizes the waveforms. The feature vector is then passed onto the sample critic module for evaluation. As a conditional GAN, our discriminator receives two inputs: the sample and the label information. In particular, the sample and label come in as data pair, either (x, y) for real data, or for synthetic data.
Waveforms from earthquake events and non-earthquake events often contain different characteristic frequency content, and directly incorporating this physical intuition and domain knowledge into our discriminator model can be advantageous. In Z. Zhang et al. (2019), we developed an adaptive filtering technique to automatically separate earthquake from non-earthquake events through a high pass filter, where the cut-off frequency is learned from the raw seismic measurements rather than manually decided. We describe the main idea of our adaptive filtering technique below. A more detailed discussion of this adaptive filtering technique can be found in our recent paper (Z. Zhang et al., 2019).


The corresponding filtered low- and high-frequency signal in time domain can be calculated by the inverse discrete Fourier transform.
The hyper-parameter of T plays an important role in separating earthquake events from non-earthquake events. An inappropriate selection of T may confuse the discriminator in learning the feature representations of the earthquake and non-earthquake waveforms. Here, the hyper-parameter of T needs to be pre-determined. Through our tests, we obtain a cut-off frequency of 3.75 Hz for the data set of our interest (Z. Zhang et al., 2019).
We next pass xlow and xhigh through two identical pipelines. Each pipeline consists of two convolution layers. As shown in Figure 5b, another input to the discriminator is a binary label, y. Similar to the generator, we first augment y to be a vector of dimension 1 × 800 with a linear layer. To match the dimension of the feature vector from the sample, we further enlarge the 1 × 800 vector to be the dimension of 32 × 800 with a convolution layer. With three feature vectors learned from xlow, xhigh, and y, we combine them to obtain a feature vector of dimension 96 × 800.
In the sample critic module, the discriminator uses the output vector from the feature extraction module to determine the quality of the input data. Specifically, we design a network of three convolutional layers with stride 3 and followed by a mean operator as illustrated in Figure 5b. The output of the discriminator is a scalar value s, which can be any positive real number, with higher values indicating higher quality (i.e., more realistic and appropriately labeled) input data pairs.
3.2.3 Value Function


where G(⋅) represents the generator, and D(⋅) represents the discriminator. represents the distribution of real samples. z represents a Gaussian noise vector. λ is a hyper-parameter, that is set to be 10 in our experiments according to Gulrajani et al. (2017).
3.2.4 Kernel Size Selection
Kernel size is another important hyper-parameter to select. Larger kernel sizes will result in a greater number of network parameters to learn with significantly increased computational cost, while smaller kernel sizes might miss some large-scale temporal features leading to suboptimal detection. In our SeismoGen model, we need to decide appropriate kernel sizes for both the generator and discriminator with consideration of the tradeoff between accuracy and efficiency. Unfortunately, there is no heuristic optimization technique to choose such a hyper-parameter to our best knowledge. Here, we select kernel sizes shown in the architecture of Figure 5 based on the “trial-and-error” strategy approach that tested various combinations of kernel sizes.
4 Experiment
In this section, we design five tests to validate the performance of our generative model. In Test 1, we first provide a performance comparison of our model versus baseline models via visualization of the synthetic samples. In Test 2, we evaluate the quality of our synthetic samples via a classification task. In Test 3, we study the robustness of our model under training sets of limited size. In Test 4, we apply our generative model on a data augmentation task. Finally, in Test 5 we employ our model on a cross-station classification task and further test its performance using a transfer learning strategy.
4.1 Test 1: Synthetic Earthquake Evaluation via Visualization
In this test, we visually verify the synthetic samples generated by our preferred and the baseline models. The visual similarity between synthetic and real waveforms is an important criterion, as traditional earthquake detection and classification techniques hinge on visual appearance. However, the visual similarity is not by itself a sufficient metric to judge the quality of our model, and hence we dig deeper in the sections that follow. In this section, we train three generative models on the full data sets from V34A, V35A, and V36A. In particular, there are 6,150, 6,720, and 6,136 real samples on V34A, V35A, and V36A, respectively with a positive versus negative ratio of 1:1.
4.1.1 Visual Appearance
Figure 6 shows positive synthetic data generated through by our GAN model in filtered form (raw form of these samples are shown in Figure S3). The positive synthetic samples share similar characteristics to those of the real positive samples in Figure 3. While P- and S-wave arrivals are apparent on all three channels, the later arriving S-wave is larger in amplitude, especially on the BHE and BHN channels. Coda waves that extend the wavetrain after the direct arrivals are also visible. We also provide six examples of negative synthetic waveforms in raw form in Figure 7 (the filtered form of these samples are shown in Figure S4). Comparing to the real negative samples shown in Figure 4, these synthesized negative waveforms are highly similar to their visual appearance.

Illustration of synthetic, filtered positive sample waveforms generated by our model. We provide two examples of three-component waveforms for each of the three stations: V34A (left column), V35A (center column), and V36A (right column).

Illustration of synthetic, unfiltered negative sample waveforms generated by our model. We provide two examples of three-component waveforms for each of the three stations: V34A (left column), V35A (center column), and V36A (right column).
The similarity between real and synthetic data extends beyond time-domain visualization. The time-frequency spectrograms of the synthetic arrivals in our data set show comparable characteristics to those of real earthquake arrivals, including channel-specific, frequency-dependent seismic energy as well as the prevalence of low-frequency noise sources. We show an example spectrogram comparison for station V34A in Figure 8, with V35A and V36A shown in Figures S1 and S2 in the electronic supplement.

Comparison of real and synthetic waveforms and spectrograms for earthquake arrivals (positive samples) on station V34A. The top rows show an example arrival from the OGS earthquake catalog, while the bottom row shows an example of synthetic arrival. The characteristics of the waveforms and the time-frequency spectrograms are comparable in both cases.
4.1.2 Comparison Study
To validate the effectiveness of our generative model, we provide a comprehensive comparison study to baseline models that vary key aspects of our generative model. A detailed discussion of each baseline model and its corresponding results are provided below.
-
Baseline 1—Direct Deployment of DCGAN
Most existing generative models based on GAN focus on image synthesis (Radford et al., 2016; RSuárez et al., 2017), with comparatively fewer focusing on generating 1D time series like those of seismic waveform data. As a first baseline test of our model, we compare it to the deep convolutional generative adversarial network (DCGAN) (Radford et al., 2016) architecture widely used in the literature. Here we adapt a network structure similar to the one in RSuárez et al. (2017), which can be seen as a single pipeline variant of our model. Based on this structure, we provide some synthetic positive and negative samples in Figure S5 in the electronic supplement. For either positive or negative synthetic samples, the DCGAN-generated waveforms of all three components become almost identical, which indicates the inappropriateness of the direct application of the DCGAN network structure to earthquake detection problems.
-
Baseline 2—Independent Generators
It is important to use a shared input for three pipelines in our generator. To demonstrate this, we design a baseline model by feeding each pipeline with an independent input pair of z and label feature vector augmented from . We show the corresponding synthetic positive and negative samples in Figure S6 in the electronic supplement. The synthetic data on all three components maintain some realistic features but are no longer correlated, both in their temporal structure and in the general characteristics of the wavepacket. For an instance, the arrivals of S-wave in BHE and BHN components are not strictly correlated in time. This is due to the fact that the three components are generated by three independent generators and no information is shared among them. More synthetic waveform samples generated by Baseline 2 are included in the electronic supplement.
-
Baseline 3—Fourier Transform Removed
Signal decomposition can be helpful in producing representative feature vectors in our discriminator. To validate this, we design a Baseline 3. In this baseline model, instead of decomposing temporal signal, we simply duplicate the input temporal signal and feed them to the pipelines, respectively. We provide the synthetic positive and negative samples using Baseline 3 in Figure S7 in the electronic supplement. Visually, the samples generated by Baseline 3 yield better results than the aforementioned two baseline models. However, there are still some generated samples that are visually inconsistent with real earthquake events like those shown in Figure 3. More synthetic waveform samples generated by Baseline 3 are included in the electronic supplement.
We include the training time cost and the number of parameters of our proposed model and all three baseline models in Table S1, which is shown in the supplement.
4.2 Test 2: Synthetic Earthquake Evaluation Via Classification

where “TP,” “TN,” “FP,” and “FN” refer to the numbers of true positive, true negative, false positive, and false negative, respectively. “Total” refers to the total number of samples in the test set. In this test, we examine the classification accuracy of models trained on real data with those trained on synthetic data. In all instances, we use the adaptive filtering classification model due to its demonstrated performance in earthquake detection (Z. Zhang et al., 2019).
We use data sets from V34A, V35A, and V36A for this test. For each station, we randomly select 4,000 samples as a training set and leave the rest as a testing set. Training set sizes are thus 4,000 samples at all three stations, while testing set sizes are 2,150, 2,720, and 2,136 samples at stations V34A, V35A, and V36A, respectively. The ratio of the positive versus negative samples in both training and test sets is 1:1. With the classification model and the data set selected, we proceed with the test on each of the generative models as in the following four steps:
-
Train the generative model with the 4,000 sample training set of real waveforms
-
Based on the trained generative model, produce an additional 4,000 synthetic samples, which become the synthetic training set for the classification
-
Train the adaptive filtering classifier with the synthetic training set
-
Test the trained classifier on the test set and report the accuracy
Intuitively, the performance of the classifier trained on real data will be better than the one trained on the synthetic data. Hence, we provide the performance of the classifier (denoted as CR) trained on real waveform data for comparison purposes. Similarly, we denote the classifiers trained on synthetic data as CS, with CS0 being our preferred SeismoGen model and CS1–CS3 corresponding to baselines 1–3. The higher the classification accuracy of CS, the better the quality of the synthetic samples that are used to train the classifier. We provide the classification results in Figure 9.

Classification results from a classifier trained on real (CR in Col. 1) and synthetic (CS in Cols. 2–5) training data. Specifically, CS0 is based on the preferred SeismoGen model, and CS1 to CS3 are based on baselines 1–3, respectively. In the figure, darker colors indicate better classification performance.
As expected, CR yields the best performance among all classifiers. The classifier CS0 based on our preferred SeismoGen model shows a stable high performance over different data sets by producing classification accuracy higher than 90.00% in all three stations. Classifiers CS1 and CS3 either perform poorly or unstably, which is consistent with our visual evaluation results reported in Section 4.1.2. Through this test, we verify that our generative model can effectively learn the key features from real seismic time series so that its synthetic samples may be as helpful as the real data for the classification task.
It is interesting to notice however that there can be some inconsistency between visual evaluation and classification accuracy. As an example, the results of baseline 2 as shown in Figure S6 can be easily identified as unrealistic samples by human experts. However, when these synthetic samples are used to train a classifier, we obtain accuracy as high as 94.89% and 94.66% based on data set V34A and V36A. This inconsistency indicates that the classification algorithm we choose here is not sensitive to the unrealistic features in these synthetic samples, possibly due to the fact that the adaptive filtering classifier favors local features while human are capable of capturing both local and global features. These unrealistic features might be detected by other classification algorithms and thus cause a much worse classification performance.
4.3 Test 3: Robustness of SeismoGen on Limited Data
Our generative model is trained on labeled data sets. Because in practice it may be difficult to obtain a large number of high-quality labels, it is worthwhile to study the robustness of our generative model when the size of the training set is limited. To do this, we design our test to train on data sets with sizes varying among 10, 20, 40, 60, and 80. We keep the ratio between positive and negative samples to be 1:1 in all those limited training sets.
To better validate the robustness of our generative model, we come up with two different sampling strategies. In our first strategy, we start by creating a training set size of 10. We randomly select five positive and five negative real training samples from a seismic station (here, V36A), and then combine them as the first training scenario (i.e., a training set size of 10). For the testing purpose, we would like to minimize the impact of different samples on predictions and to focus more on the impact of using different sized training sets. Hence, we build the remaining four training scenarios (sizes of 20, 40, 60, and 80) by accumulating certain number of positive and negative samples randomly selected to the previously created training scenario. As an example, to create a training set size of 80, we add another 10 new positive and 10 new negative samples to the previous training set size of 60. To validate the performance, we save around 6,000 samples separately as the test set. However, with the size of the training set increased, there would be some low-quality samples being added. That would unfortunately “mis-lead” our SeismoGen and result in some degraded synthesized samples. To overcome this issue, we further come up with a second strategy, where the training sets are further refined by removing those low-quality samples with a visual checking procedure.
With two groups of the five training sets being available from both sampling strategies, we train and obtain five different generative models on each strategy, namely, G10, G20, G40, G60, and G80. Using each generative model, we then synthesize a training set size of 4,000 that consists of 2,000 positive and 2,000 negative synthetic samples. Based on those five synthetic training sets of 4,000, we independently train five adaptive filtering classifiers and evaluate them using our previously reserved test set. We report the corresponding accuracy, precision, and recall of the predictions from five classifiers on each strategy in Figures 10 and 11. As a benchmark, a classifier trained on the real training set is also reported (denoted as “real” in Col. 1 of both Figures 10 and 11) and we use all 4,000 real samples as the training set.

Accuracy, precision and recall of the robustness test result for station V36A using strategy 1. We provide benchmark metrics from the real data set (Col. 1) as well as results using our generative models based on five different limited training sets (Cols. 2–6). Our model yields reasonable robustness with limited training size.

Accuracy, precision and recall of the robustness test result for station V36A using strategy 2. We provide benchmark metrics from the real data set (Col. 1) as well as results using our generative models based on five different limited training sets (Cols. 2–6). Our model yields reasonable robustness with limited training size.
We observe that results in Figures 10 and 11 generally yield the same pattern. Particularly, the classifier trained with large amounts of real data yields the best performance (Col. 1 in Figures 10 and 11). Using synthetic samples only (Cols. 2–6 in Figures 10 and 11), the classifiers still make promising predictions with accuracy higher than 85%, and exceeding 90% when the training data set size is more than 40. This indicates the robustness of our generative model with respect to limited training set sizes regardless of the sampling strategy, which can be further explained using the increasing trend of both precision and recall.
Besides the similarity of results in Figures 10 and 11, there are also some differences. In Figure 10, when the size of training set is increased from 20 to 40, there is a significant drop (more than 4%) of the accuracy of the classifier. In contrast, we do not observe such a significant drop anywhere in Figure 11. This is due to how we sample the training sets differently for Figures 10 and 11. Sampling strategy 2 yields a more consistent prediction performance compared to that obtained by strategy 1 because of removing those low-quality samples. We implement similar robustness tests on V34A and V35A data set, and report the results in Tables S2 and S3 in the electronic supplement. Similar conclusions can be drawn.
In summary, through this test, we learn that our SeismoGen can be effective when the training set is limited. This is consistent with the image synthesis task (Gurumurthy et al., 2017; Marchesi, 2017), where GAN has been proven to be effective on limited data sets.
4.4 Test 4: Data Augmentation Using SeismoGen
Data augmentation is a commonly used technique in machine learning to expand the amount of data available for training. It can be valuable for machine learning workflows for earthquake detection tasks due to the difficulty in obtaining a large volume of high-quality, labeled waveform data in certain contexts. However, traditional data augmentation techniques such as cropping, padding, or flipping are limited in their effectiveness because they do little to expand the actual diversity of waveform characteristics necessary to train detection models. Through our previous tests, we demonstrate that our preferred SeismoGen model is capable of synthesizing realistic positive and negative seismic samples. In this test, we utilize those synthetic samples to augment the training set of real waveforms and evaluate classification performance.
We design this test based on the same five generative models (G10, G20, G40, G60, and G80) from strategy 2 of Test 3 due to its consistent performance. The same real training sets of limited sizes (10, 20, 40, 60, and 80) from Test 3 are also used for this test. We use those generative models to produce different numbers of synthetic samples that will be combined with the existing real training set. For the ease of demonstration, we use a variable r to stand for the augmentation ratio of the synthetic samples to be added to the initial, real sample. We choose six different augmentation scenarios including r1 = 1:1, r10 = 10:1, r50 = 50:1, r100 = 100:1, r200 = 200:1 and r300 = 300:1. Take G10 and r50 = 50:1 as an example, we begin with a training data set of 10 real samples (five positive and five negative). We then generate 50 × 10 = 500 synthetic samples (250 positive and 250 negative) and combine them with the existing limited real training set size of 10 to give an augmented training set size of 510. We then train an adaptive filtering classifier using this augmented training data set. We report our classification accuracy, precision and recall using the V36 A test set in Figures 12-14, respectively. As a baseline, we also include a scenario of a zero-augmentation test, where the classifier is trained only on real data set and we denote this as r0.

Detection accuracy using classifiers trained on augmented training sets from station V36A. In the figure, darker colors indicate better classification performance.

Detection precision using classifiers trained on augmented training sets from station V36A. In the figure, darker colors indicate better classification performance.

Detection recall using classifiers trained on augmented training sets from station V36A. In the figure, darker colors indicate better classification performance.
We observe in Figure 12 several interesting outcomes. The baseline (r0) usually yields worse classification accuracy compared to the augmentation scenarios. From a row-to-row comparison, the performance for each column follows a general pattern of “increase first → peak → decrease later.” This type of pattern has recently been discovered and analyzed in other data augmentation literature (Karras et al., 2020; Zhao et al., 2020), which is mainly due to the “augmentation leak,” meaning that synthesized data dominate the training set and distorts the real data distribution. From a column-to-column comparison, we notice that more real data utilized to train our SeismoGen leads to a increased performance trend with some variances. For instance, the performance reduction from G20 to G40 may be caused by some low-quality real samples. Through this test, we conclude that the synthetic samples generated by our generative model can improve the performance of the classifier by data augmentation.
4.5 Test 5: Cross-Station Detection and Transfer Learning Strategy
For future applications, it is important to understand the performance of SeismoGen in cross-station scenarios in which the model is trained on a station different from the final target station on which it is tested. To perform an initial test along these lines, we train SeismoGen with a training set from a source station. We then train a classifier with 4,000 synthetic samples produced by the generative model and evaluate the classifier on the test set from a target station. As a comparison, we also implement the same cross-station test with a non-generative approach, meaning that a classifier is trained using 4,000 real samples from a source station before applying it on the same test set from the target station. For the purposes of this test, we select either Station V34A or Station V35A as the source station and Station V36A as the target station. We report (in Figure 15) the results of cross-station tests based on both synthetic and real samples. As expected, the classifiers trained on real samples yield better performance than the ones trained on generated synthetic samples. However, classifiers trained on synthetic samples still yield competitive classification results.

Results of cross-station test. Cols. 1 and 2 show the classifiers trained on synthetic samples produced by SeismoGen trained on source station V34A and V35A. Cols. 3 and 4 show the classifiers trained directly on real samples from the same source stations. The three rows show the accuracy, precision, and recall of each classifier on the test set from the target station V36A. In the figure, darker colors indicate better classification performance.
Transfer learning is another domain adaptation technique that has been widely used to improve the prediction when data sets are available from different domains (Chai et al., 2020; Tan et al., 2018). To expand on the cross-station tests reported in Figure 15, we further implement a transfer learning strategy and evaluate its performance. Four different types of deep transfer learning methods have been developed in recent years including instance-based deep transfer learning, mapping-based deep transfer learning, network-based deep transfer learning, and adversarial-based deep transfer learning (Pan & Yang, 2010; Tan et al., 2018). Our work belongs to network-based deep transfer learning, where we pre-train SeismoGen using data from a source domain and fine-tune it with data from the target domain. Similar to the cross-station tests, we select either Station V34A or Station V35A as the source station and Station V36A as the target station.
We report in Figure 16 the results from our transfer learning tests. We observe that the transfer learning strategy can be effective to improve the performance of models trained on both synthetic and real samples. Generally speaking, an even larger improvement can be observed in the synthetic scenario than in the real scenario. It is worth mentioning that transfer learning may be most effective when some relations (low-level features, weights, etc.) existing between source and target domains. Besides, the success of transfer learning also highly relies on the availability of sufficient amount of labels in the target domain.

Results of transfer learning test. On the test set of target station V36A, Cols. 1 and 2 show the accuracy, precision and recall of classifiers pre-trained on the synthetic training set from source stations V34A and V35A. Cols. 3 and 4 show the accuracy, precision and recall of classifiers pre-trained on the real training set from source stations V34A and V35A. In the figure, darker colors indicate better classification performance.
5 Discussion and Future Work
Here, we have demonstrated how a machine learning approach based on the conditional generative adversarial network (CGAN) can be used to generate realistic seismic waveforms that sample either earthquake or non-earthquake classes. A generative model of this type may have multiple use cases in seismology. The focus is on data augmentation, where we have shown that synthetic waveforms can be used to expand the amount of available training data and thereby improve the classification performance of machine learning algorithms when applied to real data sets. A potentially related use case would be the application of synthetics of this type to test the robustness of detection algorithms. A particularly salient example would be in the field of earthquake early warning, where distinguishing between the earthquake and non-earthquake events is of fundamental importance (Meier et al., 2019). Background noise plays a critical role in the performance of earthquake detection methods (Seydoux et al., 2020; Zhu et al., 2019). Labeling and separation of negative samples (noisy waveforms) may require a similar amount of efforts as acquiring positive samples. This is particularly true when examining smaller earthquakes. On the other hand, having mistakenly selected negative samples (e.g., a small earthquake buried in the noise) will confuse the detector and therefore degrade the performance. Due to the CGAN-based structure, SeismoGen is capable of generating both high quality positive and negative samples for augmenting the training set. Furthermore, our generative model can synthesize data with multiple labels. The focus is on two-label scenario: Earthquake and noise classes. However, it would be interesting to investigate this in future for different applications, for example categorizing various types of seismic events.
The techniques we outline in this manuscript have limitations that are important to be aware of. Perhaps the most obvious is that the model is constrained based on training records from only three stations within a small study region in Oklahoma. Because of this, the model learns what earthquake and non-earthquake waveforms tend to look like at these stations, and is capable of reproducing these basic features in its generative model. Indicated by our cross-station and transfer learning results, it is reasonable to believe that our model may yield some generalization ability to unseen seismic data even without additional training. However, further work is needed to study this issue in more detail before drawing any final remarks.
While the channel-to-channel temporal correlations in the synthetics are realistic, the model has no understanding of the expected moveout of waveforms across a seismic network that is crucial in many seismic applications. It is also important to note that natural earthquake data sets are inherently unbalanced, with fewer arrivals from near-source distances and from larger earthquakes. Our generative model reflects this bias, and thus is unlikely to capture the waveform characteristics of the rare occurrences that may be of most interest for some tasks. For these reasons, the model we present here should be viewed more as a proof of principle that the methodology is promising, rather than a finished machine learning product ready for widespread deployment.
Moreover, our approach is fundamentally data-driven. We train our model on data, which we posit is sufficient to learn the details of the task at hand. However, real earthquakes, and the seismic waves they broadcast, obey the physical constraints of the governing equations and constitutive laws of dynamic rupture and seismic wave propagation. Incorporating aspects of the known, underlying physical theory in the form of hybrid, physics-informed machine learning models is an active area of research. We hope that in future work, we can improve our generative modeling framework by adopting a more holistic, physics-informed approach. The application of a conditional GAN in this framework may be particularly powerful, as it could allow one to condition on features like magnitude and source-site distance, both of which play a fundamental role in the recorded amplitudes of strong ground motion.
6 Conclusions
We develop a generative model that can produce realistic, synthetic seismic waveforms of either earthquake or non-earthquake (noise) classes. Our machine learning model is in essence a conditional GAN designed to operate on three-component waveforms at a single seismic station. To verify the efficacy of our generative model, we apply it to seismic field data collected at Oklahoma. Through a sequence of qualitative and quantitative tests and benchmarks, we show that our model can generate high-quality synthetic waveforms. We further demonstrate that the performance of machine learning-based detection algorithms can be improved by using augmented training sets with both synthetic and real samples. Our generative model has several potential use cases across seismology, but our focus here is on the earthquake detection problem.
Acknowledgments
The authors declare no conflicts of interest. This work was funded by the Center for Space and Earth Science (CSES) at Los Alamos National Laboratory (LANL) and the Laboratory Directed Research and Development program of LANL under project numbers 20210542MFR. DTT acknowledges institutional support from The University of Texas at Austin's Rising STARS program. The experiment was performed using supercomputers of LANL's Institutional Computing Program. We also would like to acknowledge Dr. Jake Walter from Oklahoma Geological Survey and the University of Oklahoma for providing a revised catalog. Thanks to USGS for providing the raw seismic data from Transportable Array (network code TA).
Open Research
Data Availability Statement
The seismic data sets collected at Stations V34A, V35A, and V36A used here were downloadable from the openly accessible Data management Center managed by IRIS (http://ds.iris.edu/ds/nodes/dmc/). All the training sets that are used for our model can be downloaded from the Gitlab repo (https://gitlab.com/huss8899/seismogramgen). All the results generated using our generative model as well as all the baseline models have also been shared in the report for any potential interests from readers.