Machine Learning for Source Identification of Dust on the Chinese Loess Plateau

The provenance of voluminous eolian dust on the Chinese Loess Plateau (CLP) is still highly debated. Here we apply machine learning methods of support vector machine and convolutional neural network to train models using element compositions of surface sediments from eight potential source regions, accordingly, to determine the dust sources and contributions by classifying the last glacial loess and present interglacial sediments on the CLP. The trained models succeed in differentiating major secondary sources and quantitatively estimating the contributions of both primary and secondary sources at least during the last glacial‐interglacial cycle. The understanding that a constant dust source despite changing climate conditions agrees with those derived from Sr‐Nd isotopes and U‐Pb age spectra. Our observations demonstrate that big geochemical data sets coupled with machine learning technology are fully capable of tracing sources.


Introduction
The alternating deposition of loess and paleosols during glacial-interglacial cycles on the Chinese Loess Plateau (CLP) in Central China serves as one of the most complete archives of regional climate change (An, 2000;An et al., 2001;Guo et al., 2002). Constraining the dust provenance is thus of profound paleoclimatic significance in Central Asia (Licht, Dupont-Nivet, et al., 2016;Liu & Ding, 1998). Despite its importance, the dust provenance remains highly debated and is further complicated because the dust sedimentation has been demonstrated to be episodic, not continuous , which relates to the spatial-temporal variability (Kapp et al., 2015) at both tectonic and glacial-interglacial timescales (Sun et al., 2020).
After years of study, a major consensus that the NTP and the Central Asian Orogenic Belt (CAOB) are two primary sources has been reached (Sun et al., 2020). Nevertheless, major secondary sources such as the Gobi Desert, Qaidam Basin, and the river sediments cannot be clearly discriminated. Meanwhile, the processes that control aridification in Central Asia and concurrent sedimentation of eolian dust are at a continental scale (Guo et al., 2002;Licht, Dupont-Nivet, et al., 2016). To our knowledge, however, there is no spatial assessment of compositions of dust both on the CLP and from the PSRs and their relation at this scale. If any, our empirical knowledge might be inadequate to reckon with resultant big geodata. In this sense, our understanding of the provenance and quantitative estimation of the source contributions (e.g., Jia et al., 2019; is still incomplete. Consequently, large-scale spatial geochemical assessment coupled with big-data approaches is immediately required. Machine learning allowing computers to handle new situations via analysis, self-training, observation, and experience has been successfully applied in the geosciences (Bergen et al., 2019). Zhao et al. (2019) use random forest and deep neural network to predict the origin of Cenozoic basalts in Northeast China. Mousavi and Beroza (2020) design a regressor composed of convolutional and recurrent neural networks to estimate the earthquake magnitude. Using an artificial neural network, Withers et al. (2020) successfully build a ground motion model. In this study, we apply two supervised machine learning algorithms, support vector machine (SVM) and convolutional neural network (CNN), to determine the dust sources simply by classifying the last glacial loess (LGL) and present interglacial sediments (PIS) on the CLP. To build the classifying models, a continental-scale geochemical mapping data set covering the PSRs (Figure 1, see section 2) is utilized as training data. The features learned by the models indicate that the PSRs are geochemically separable, which means even though fed by similar primary sources, the secondary sources have their own unique geochemical signatures. The trained models are thus capable of predicting which PSRs the LGL and PIS on the CLP resemble the most. Principal component analysis (PCA) is used to aid in understanding the probable dominant geochemical features of the PSRs extracted by SVM and CNN. Based on the classification, we identify the real secondary sources and estimate the contributions of both primary and secondary sources to the dust deposition. Our study also demonstrates the feasibility of applying CNN to big geodata processing as well as the bright future of using element geochemistry and big-data approaches (e.g., machine learning technology) to solve problems in land surface and solid-earth geoscience.  Pullen et al. (2011). Circled H means high pressure over Central Asia. DT R. and HS R. are Datong River and Huangshui River, respectively.  Figure 1). The whole sampling is deployed at the continental scale, that is, to document the abundance of chemical elements in materials occurring at the Earth's surface often with a density and coverage larger than 1 sample/1,000 km 2 and 0.5 million km 2 , respectively (Darnley et al., 2005). Concentrations of 67 elements, 11 oxides, H 2 O + , total carbon, and organic carbon are determined (Supporting Information Text S1) and used in the machine learning modeling. The data are first screened to account for missing values (unreported, often appear as blank or null) and censored values (greater or less than the detection limits).

Data and Methods
Considering that geochemical data are compositional (closed) data, centered log-ratio (clr) transformation (Aitchison, 1986) is used to resolve the closure problem. Zero-mean normalization is applied next (Text S2). Supervised machine learning algorithms can "learn" to recognize a pattern and build models from known examples (training set), test and optimize the trained models via labeled validation set, and ultimately make predictions for previously unseen data (Bergen et al., 2019). SVM (Cortes & Vapnik, 1995) and CNN (LeCun et al., 1998) are used in this study to build the classifying models.
The SVM method is applicable to both binary and multi-class classifications by defining the optimal boundary (ies) between (among) classes with either linear or such nonlinear kernels as radial basis function (Chang & Lin, 2011;Cracknell & Reading, 2013;Shahnas et al., 2018). The biggest advantage of SVM is its effectiveness in high-dimensional spaces. In this work, hyper-parameters of the SVM including the penalty to avoid overfitting the model and kernel function are finely tuned by using grid search based on a tenfold cross-validation to find the best estimators (Text S3). Balanced class weight is used to address the imbalance among the sample numbers.
The CNN algorithm is a popular class of deep neural networks (LeCun, Bengio, & Hinton, 2015;LeCun, Bottou, et al., 2015) and has been widely applied in computer vision and natural language processing (Krizhevsky et al., 2012;LeCun et al., 1998). The CNN network structure is composed of convolutional layers for feature extraction, pooling layers for feature compression, and fully connected layers for classification (Bergen et al., 2019). The neural network is trained using backward propagation of errors with stochastic gradient descent (LeCun et al., 1998). In this research, the cross-entropy (Shore & Johnson, 1980) and adaptive moment estimation (Adam) optimizer (Kingma & Ba, 2014) are implemented to optimize the weights and minimize the mean loss. A large number of parameters such as the number of convolutional layers (or the network structure), learning rate, and dropout probability are varied to find the best estimators and prevent overfitting the model (Text S4). Figure 2 shows the CNN network structure used in this contribution.
The performance of the two models is estimated by four metrics, that is, accuracy, precision, recall, and F1 score. Accuracy is the ratio of correct predictions to all predictions. Precision refers to the ratio of true positives to true and false positives in one class. Recall denotes the ratio of true positives to false negatives, reflecting the classifier's completeness. F1 score, with the formula of (2 × precision × recall)/(precision +recall), balances precision and recall and is usually regarded as the most effective metric.
PCA, which highlights inter-group differences and within-group similarities in a data set, is a powerful method for process discovery and pattern identification . It discovers new linear combinations of the variables (elements) based on measures of association (correlation). A compositional biplot is usually used to describe PC loadings for each element (Aitchison & Greenacre, 2002) which displays their contribution to the PCs. The larger the absolute value, the greater the contribution. PCA is used in the present study only for extracting geochemical features and illustrating variations of element associations of the PSRs to aid in understanding the machine learning modeling.

Machine Learning Modeling
The input data are chemical compositions of 67 elements, 11 oxides, H 2 O + , total carbon, and organic carbon of the LGL and PIS on the CLP. The target output is one of the eight PSRs. Both machine learning algorithms present good performance. For SVM model, the kernel type of radial basis function (rbf), kernel coefficient (γ) of 0.01, and penalty (C) of 26 are used (Text S3), and the resultant four metrics, that is, accuracy, precision, recall, and F1 score, are 0.97, 0.98, 0.97, and 0.97, respectively (Table S1). The CNN model is composed of three convolutional layers with 32, 64, and 32 neurons; that is, a neural network of "Input (81)-32-64-32-Output (8)" is built (Figure 2 and Text S4), and the four measuring metrics are all 0.94 (Table S2). The two well-performed models suggest that the differentiation of major secondary PSRs is successful, effective, and reliable. This means that even though fed by similar primary sources (Sun et al., 2020), these secondary sources have their own unique geochemical signatures identified by the two used algorithms, although what features extracted by the two algorithms may have played a role is unclear because of the black-box effect of the used methods. Nonetheless, we speculate that the two models may have recognized valid geochemical patterns related with regional differences of the PSRs as is shown by the PCA biplots. In Figure 3a, the LGL and PIS are well characterized not only in the direction of the first principal component (PC1) with large positive loadings on REEs and high-strength field elements (HFSEs) but also in the direction of the second principal component (PC2) with large negative loadings on large ion lithophile elements (LILEs) as well as total carbon. A small part of LGL and PIS is featured by negative PC1 and positive PC2 showing mafic signatures (e.g., Cr, Co, Ni, MgO, and Pt). Compared with Figures 3b and 3c, it is clear that the Qaidam Basin, Tarim Basin, ETP, and Hetao Graben (termed as NTP-related secondary sources) share similar distribution on the PC biplot with the LGL and PIS than the Junggar Basin, N. Alxa, S. Alxa, and northeastern sandy deserts do. However, by examining Figure 3d, only ETP and a small part of N. Alxa and S. Alxa fall into the CLP domain with negative PC3 (e.g., MgO and CaO) as well as negative PC6 loadings (e.g., organic carbon and N). Meanwhile, even though fed by similar sources (i.e., the CAOB, Zhang et al., 2016), the Junggar Basin and N. Alxa can be distinguished by felsic signature demonstrated by positive PC1 loadings (Figure 3c). Comparably, SVM and CNN can discover valid geochemical patterns of the PSRs based on the used 81 elements. Figure 4 shows the classification results by the two models. Two distinct similarities can be observed: (1) The two algorithms deliver generally consistent classification of both LGL and PIS into the PSRs, with clear dominance of the Hetao Graben, Qaidam Basin, and ETP (Figures 4a-4d), and (2) the classification patterns for LGL and PIS resemble each other (Figures 4a-4d). With the accuracy of each model being the weights, weighted average proportions of each PSR in LGL and PIS are obtained (Figures 4e and 4f). For both LGL and PIS, almost half is grouped into the Hetao Graben, around 15% and 15-25% into the Qaidam Basin and ETP, respectively. The remainder is roughly equally (about 5-6%) categorized into the Junggar Basin, Tarim Basin, and S. Alxa Plateau. The N. Alxa Plateau accounts for a small proportion less than 1%. There are no LGL and PIS grouped into the northeastern sandy deserts. More importantly, from LGL to PIS, slight variation (<10%) in the same PSR is detected (Figures 4e and 4f). In this regard, the classification of LGL and PIS exhibits an overall consistency.

Classification of the Dust on the CLP
We also verify the spatial agreement of the classification of LGL or PIS at the same sites by the two models. It turns out that 65% of all the dust samples have the identical predicted outputs in both models, illustrating a relatively high spatial consistency. Although the remaining 35% are not identical in the two models, confusion mostly occurs between the Hetao Graben and the other PSRs like the Qaidam Basin and ETP. The 35% discordant classification by the two models shows no clear spatial patterns and might be due to three reasons: (1) the common source region for some PSRs, for example, northeastern Tibetan Plateau for the Qaidam Basin, ETP, and Hetao Graben (Li et al., 2009;Zhang et al., 2016); (2) dust homogenization or mixing on the CLP (Kapp et al., 2015;Li et al., 2018;; and (3) slightly different importance of elements during the classification by the two models. However, this could also indicate that the Hetao Graben is probably a storage of materials from other PSRs (Nie et al., 2015) considering its geographic location.

Dust Provenance
Our results show that the majority (approximately 80%) of LGL and PIS bear resemblance to the surface sediments from the Hetao Graben, ETP, and Qaidam Basin. Besides our results, zircon U-Pb age spectra also emphasize the significance of ETP and Qaidam Basin (Pullen et al., 2011) and the recycled fluvial deposits from the Hetao Graben in the eolian sedimentary budget on the CLP as well Nie et al., 2015). Therefore, based on the machine learning modeling, we propose that reworked fluvial sediments from the Hetao Graben contribute the most (approximately 50%) to the dust sediment on the CLP during the last glacial-present interglacial cycle, followed by ETP (14-24%), Qaidam Basin (14-16%), Tarim Basin, Junggar Basin, and S. Alxa (5-7%), and N. Alxa (<1%).
Nevertheless, the Hetao Graben, whose materials are derived largely from the northeastern Tibetan Plateau (Lin et al., 2001;Nie et al., 2015) and adjacent highlands, still serves as a "transitional storage." This is revealed by the confusion between the Hetao Graben and the rest of the PSRs during the classification. Consequently, we apply the above-described two machine learning models to identify the provenance of the Hetao Graben. The two models perform well, and the four measuring metrics are all 0.93 for SVM and 0.92 for CNN (Tables S3 and S4). The corresponding weighted average proportions are then obtained (Figures 4g and 4h). Clearly, the S. Alxa, N. Alxa, and ETP contribute greatly to the Hetao Graben materials, which agrees with former studies. Given that the S. Alxa is composed mainly of the Qilian piedmonts on the NTP (Zhang et al., 2016), we combine the S. Alxa, ETP, Qaidam Basin, and Tarim Basin as the NTP-sourced PSR which explains more than 65% of the contribution to the Hetao Graben (Figures 4g and 4h). Some tributaries of the Yellow River like the Datong River and Huangshui River draining the Qilian Mountains ( Figure 1) could account for the similarity between the S. Alxa and Hetao Graben. Contributions from the Junggar and Tarim Basins are likely to be resulted from the westerly and northwesterly winds and dust storms over the Hexi Corridor (Roe, 2009). It has been suggested that element ratios/compositions are not reliable geochemical tracers (Chen & Li, 2011;Sun et al., 2020) because changes in element compositions by either fluvial or eolian processes would complicate source attribution. However, the decomposition of the Hetao Graben demonstrates that big geochemical data sets coupled with machine learning technology are fully capable of tracing sources.
In accordance with the results of the two-stage machine learning modeling, we propose that the eolian dust sedimentary budget on the CLP is dominated by the NTP-sourced materials (approximately 75-80%) and minored by the CAOB-sourced materials (approximately 20-25%), at least during the last glacial-present interglacial cycle, indicating a general constant CLP dust source at this timescale. Our results are consistent with studies of zircon U-Pb age spectra and whole rock geochemistry that also support the NTP as the dominant dust sources (Pullen et al., 2011;Zhang et al., 2016). Sun et al. (2020) also make a similar conclusion that the NTP and the CAOB are two primary sources of the Quaternary loess-paleosol deposits on the CLP by thoroughly reviewing Sr-Nd isotopes, quartz ESR-CI-δ 18 O data, and zircon age spectra. However, this study presents the first quantitative estimation of the contributions of both primary and secondary sources based on big geochemical data sets and accurate machine learning classification.
As for the glacial-interglacial consistent classification depicting a constant dust source, by using Sr-Nd-Hf isotopes, Bird et al. (2020) point out that a major established and constant dust source on the Tibetan Plateau has been active and unchanged since late Miocene, despite dramatically changing climate conditions. Our results further indicate that the contributions of major secondary sources remain constant at this timescale. The barely changed main dust transporting winds controlled by the East Asian Winter Monsoon (Figure 1) play an important role in the deposition of the Quaternary loess (Sun et al., 2020) and may explain the constant structure of the Hetao Graben, Junggar Basin, and Alxa Plateau to the northwest of the CLP. During glacials, influence of enhanced cooling over the North Atlantic extends notably far eastward via the westerlies (Vandenberghe et al., 2006); the jet stream is largely restricted to the south of the Tibetan Plateau (Pullen et al., 2011) leading to strong penetration of the westerlies into Central Asia (Figure 1), being responsible for the dust transportation from the NTP to the CLP. During interglacials, when this westerly erosive pathway is inactive, evidence of NTP-sourced materials is likely due to extensive eolian cannibalism process on the CLP (erosion of older deposits, Kapp et al., 2015;Nie & Peng, 2014). Changing eolian sources at both tectonic and glacial-interglacial timescales have been reported by previous studies (e.g., Sun et al., 2020;Sun & Zhu, 2010;Xiao et al., 2012). However, this work focuses on the late Pleistocene, during which no major tectonic changes occurred. In addition, we have to acknowledge that the glacial-interglacial provenance fluctuations are acquired by different grain-size fractions (Sun et al., 2020). Therefore, more work is needed to determine the onset of this constant source structure and its causes. Additionally, our results do not rule out possible contributions from other areas such as Ordos Plateau, which lies downwind of the Alxa Plateau and Hetao Graben (Nie et al., 2018). It is not included in the present study because of its small areal extent and the used sampling density, which cannot produce enough samples to be modeled in the machine learning algorithms.

Conclusions
We investigate the sources of the dust on the CLP by applying two supervised machine learning algorithms. SVM and CNN models are trained by using 81 element compositions of 737 surface sediments from eight PSRs. The trained models are used to determine the dust sources by classifying the LGL and PIS into the PSRs. Classifications of the two models are found to be consistent. We show that the sedimentary budget on the CLP is dominated (approximately 50%) by reworked fluvial sediments from the Hetao Graben during the last glacial-present interglacial cycle, followed by ETP (14-24%), Qaidam Basin (14-16%), Tarim Basin, Junggar Basin, and S. Alxa (5-7%), and N. Alxa (1%). Through decomposing the Hetao Graben, two primary sources, that is, the NTP-sourced and CAOB-sourced materials, contribute 75-80% and 20-25% to the dust sedimentation, respectively. Moreover, a constant dust source despite changing atmospheric circulations is identified. Our results agree with those derived from Sr-Nd isotopes, quartz ESR-CI-δ 18 O data, and zircon age spectra, and we quantitatively estimate the contributions of both primary and secondary sources. This study demonstrates that machine learning is an ideal approach to processing high-dimensional geochemical data and would aid our understanding of Earth system processes.