Relativistic Electron Flux Prediction at Geosynchronous Orbit Based on the Neural Network and the Quantile Regression Method

Geosynchronous satellites are exposed to the relativistic electrons, which may cause irreparable damage to the satellites. The prediction of the relativistic electron flux is therefore important for the safety of the satellites. Unlike previous works focusing on the single‐value prediction of relativistic electron flux, we predict the relativistic electron flux in a probabilistic approach by using the neural network and the quantile regression method. In this study, a feedforward neural network is first designed to predict average daily flux of relativistic electrons (>2 MeV), or the expectation of the flux from the probabilistic perspective, at geosynchronous orbit 1 day in advance. The neural network performs well, with the average root mean squared error, the average prediction efficiency, and the average linear correlation coefficient between observations and predictions reaching 0.305, 0.832, and 0.916, respectively, during the periods of 2011–2017. We then combine the quantile regression method with the feedforward neural network to predict the quantiles of relativistic electron flux by applying a special loss function to the neural network. We use the multiple‐quantiles prediction model to predict flux ranges of the relativistic electrons and the corresponding probabilities, which is an advantage over the single‐value prediction. Moreover, it appears to be for the first time that the approximate shape of the probability density function of relativistic electron flux is predicted.


Introduction
Geosynchronous orbit (GEO), which is located close to the outer radiation belt, is a critical region since hundreds of satellites for telecommunication, navigation, and other applications operate in this region. Nonetheless, these satellites are exposed to the relativistic electrons with energy reaching MeV, which can cause irreparable damage to the satellites (Horne et al., 2013;Lai et al., 2018;Pilipenko et al., 2006;Wrenn et al., 2002). Therefore, predicting the flux of relativistic electrons at GEO is essential for the safety of the satellites.
In the past decades, great progress has been made in understanding the acceleration mechanism of relativistic electrons and providing techniques in predicting the relativistic electron flux at GEO. There are mainly two kinds of theories that explain the electron acceleration mechanism, namely, the radial transport (Li et al., 2001) and the local acceleration (Horne & Thorne, 1998;Roth et al., 1999;Summers et al., 1998;Summers & Ma, 2000). Based on these two theories, some physical models are proposed to make predictions. Li et al. (2001) and Li (2004) used solar wind parameters, Dst, and historical flux values of MeV electrons to predict the average daily flux of MeV electrons 1-2 days in advance, based on the standard radial diffusion equation. The prediction efficiencies of their works reached around 0.6 and 0.72, respectively. Turner and Li (2008) used the flux of low-and high-energy electrons to predict the average daily flux of high-energy electrons several days in advance, based on the strong correlation between the flux of low-energy electrons and the flux of high-energy electrons. A combination of this model and the radial transport model produced an average prediction efficiency (PE) of 0.76 for the Years 1995-2006. Meanwhile, Lyatsky and Khazanov (2008 developed a similar model focusing on the sources and losses of relativistic electrons at GEO to make predictions 1 day in advance by using the solar wind data as the only input parameters. In addition to the physics-based models, there are also many empirical models and data-based models such as the linear prediction filters (Chen et al., 2016;Tsutai et al., 1999), the support vector machine (Wang & Shi, 2012), the multivariate autoregressive model plus Kalman filter (Sakaguchi et al., 2013), the nonlinear autoregressive moving average with exogenous inputs model (Boynton et al., 2013(Boynton et al., , 2015, the empirical mode decomposition model (Qian et al., 2019), and artificial neural networks (Fukata et al., 2002;Guo et al., 2013;Koon & Gorney, 1991;Ling et al., 2010;Shin et al., 2016;Wei et al., 2018). Most of these models use solar wind parameters, geomagnetic indices, and historical flux values of relativistic electrons or low-energy electrons to predict relativistic electron flux. Among these models, artificial neural networks are very important as they are adopted by many works, which can be traced back to the early 90s and continue to present. The artificial neural networks are powerful computational data-based models that can capture and represent complex input/output correlations. The techniques to build the networks and train the model are being improved over time. Koon and Gorney (1991) used a one-hidden-layer feedforward neural network (FFN) to predict relativistic electron flux with the Kp index as inputs for the first time. Fukata et al. (2002) used an Elman neural network which was different from the feedforward neural work, as it had feedback connections from the hidden layer to the input layer. The inputs of the network were only AL and Dst indices. Ling et al. (2010) developed a one-hidden-layer FFN to predict relativistic electron flux at GEO using historical flux values of relativistic electrons and Kp index as inputs. It produced a mean PE of 0.71 for 6-month test intervals. Guo et al. (2013) adopted a radial basis function neural network to predict >2 MeV electron flux at GEO, which used historical values of flux, solar wind velocity, and ap index as inputs and produced PEs of 0.766, 0.808, and 0.882 for Years 2008, 2009, and 2010, respectively. Shin et al. (2016 adopted a FFN with a more complex structure as the network had two hidden layers. It predicted geosynchronous electron flux in a wide energy range and at hourly resolution. More recently, Wei et al. (2018) used historical flux values of relativistic electrons and two parameters to predict relativistic electron flux at GEO. The neural network of their work is the long short time memory network, which fits to solve the time series problem.
All of the abovementioned works are the so-called single-value prediction, which means they only predict a single value of the electron flux, without any reliable assessment of uncertainties. It is supposed that one of the main themes of the prediction works in the future is to shift the forecasting paradigm to a probabilistic approach focusing on the reliable assessment of uncertainties (Camporeale, 2019). Miyoshi and Kataoka's (2008) work on the probabilistic forecast of relativistic electron flux enhancement at GEO appears to be the only model that produces a probabilistic forecast. However, it is based on the statistical result and the special condition of the stream interface events, which may limit its application.
In this study, we first describe a FFN model designed to make single-value prediction of relativistic electron flux at GEO 1 day in advance. Then, we combine the quantile regression method with the FFN to predict the quantiles of relativistic electron flux. Moreover, by making multiple-quantiles prediction, we show other capabilities of the combined model. It can predict flux ranges of the relativistic electrons and the corresponding probabilities, which is an advantage over the single-value prediction. And it appears to be for the first time that the probability density function of the relativistic electron flux is predicted. The data and codes of this study are available online (https://github.com/studyalltheway/Relativistic-Electron-Flux-Prediction).

Data
In this study, six solar wind parameters, four geomagnetic indices, and fluxes of medium and high energy (tens keV to MeV) electrons are used to predict relativistic electron flux. The solar wind parameters include magnetic field intensity (B), solar wind temperature (T), velocity (V), pressure (P), proton density (N), and electric field intensity (Ey). The geomagnetic indices consist of Kp, Dst, ap, and AE indices. The electron fluxes are composed of fluxes of >2 MeV, >0.8 MeV, 475 keV, 275 keV, 150 keV, 75 keV, and 40 keV electrons. The relativistic electron flux refers to the flux of >2 MeV electrons.

Data Sources
The solar wind parameters are obtained from the OMNIWeb, and the geomagnetic indices are obtained from the World Data Center for Geomagnetism, Kyoto, with 1-hr resolution. The integral electron flux data of seven energy channels (>2 MeV, >0.8 MeV, 475 keV, 275 keV, 150 keV, 75 keV, and 40 keV) are obtained from GOES 13. The time interval is from 7 January 2011 to 14 December 2017 since the flux of the medium-energy electrons are available only during this period. The high-energy electron flux (>2 and 0.8 MeV) are 1-min resolution data, while the medium-energy electron flux (475, 275, 150, 75, and 40 keV) are 5-min resolution data.

Data Preprocessing
For all of these 17 variables, the raw data are adjusted to the 1-day resolution. As for the solar wind parameters and geomagnetic indices, the daily data are calculated as the mean values of the raw data every day. If there is not any available hourly data in a day, this day will be marked as the missing point. As for the high-energy electron flux, the raw data are first adjusted to 1-hr resolution data and then adjusted to the average daily data. After the daily data are constructed, we supplement some missing data by linear interpolation method (around 5.8% of all the >2 MeV data) and then calculate the log10 values. The daily medium-energy electron flux data are obtained in the same way, while the raw data of medium-energy electron flux are 5-min resolution data. Through this process, 2,423 data points are available eventually.

Correlation Analysis
The Spearman correlation coefficients analysis is used to select the inputs of the neural network. The result of the Spearman correlation coefficients based on the data from 2011-2017 is shown in Figure 1. It shows that the correlation coefficients are quite high between the relativistic electron flux and these input variables in the last 5 days. Therefore, the data of the last five consecutive days are selected to predict relativistic electron flux 1 day in advance.

Data Sets
Three data sets, training set, validation set, and test set are made, with the data of approximate 1 year as a test set, while the remaining 6-year data as the training and validation set. Moreover, we adopt the 10 cross-validation method. In this way, the 6-year data are divided into 10 chunks, with nine chunks to train the model and the remaining one chunk to validate the result. With this process running 10 times, 10 submodels with each submodel corresponding to different validation set are acquired. The final prediction of the test set is the mean value of the predictions of these 10 submodels.

The FFN
The FFN is a widely used type of neural networks, where the information moves only in one direction, forward, from the input nodes through the hidden nodes and then to the output nodes. Usually, the FFN is composed of fully connected layers, which we call the fully connected FFN (FCN). The FCN consists of an input layer, one or more hidden layers, and an output layer, as shown in Figure 2a. In the FCN, each node in one layer connects with a certain weight to every node in the following layer, which means it is "fully connected." Fundamentally, the FCN approximates a continuous function through a composite function. An FCN with one hidden layer can approximate any continuous function with arbitrary accuracy as long as it has enough nodes (Hornik et al., 1989). The mathematical form of the neural network can be presented as is the output of the jth node of layer k + 1, x k i is the output of the ith node of layer k, w k ij and b k j are corresponding weights and bias, which are training parameters, and f activation is the activation function.
The outputs of nodes in the output layer can be seen as composite functions with many parameters. Correspondingly, the training of the network is to find the best parameters to make the outputs approach the true values. The method to adjust the parameters is "backward propagation" based on the gradient descent theory (Rumelhart et al., 1986). By back-propagating the "error" from the output layer through the hidden layers to the input layer, the parameters of the network are adjusted along the negative gradient descent direction. The error is the value of the loss function, which evaluates the difference between the outputs and the true values. Repeating this process many times can make the error small and may obtain a good result.

The Optimization of the Neural Network and Our FFN Model
We adopt a FFN in this study, which is a two-hidden layer FCN with an external head or the scaling transformation layer, as shown in Figure 2b. Some training techniques are applied to optimize the neural network. The techniques include scaled exponential linear unit (SELU) and sigmoid activation functions, the L1 and L2 regularization, the grid search method to find the best hyperparameters, and a scaling transformation layer with L1 regularization added on the FCN.
Activation functions are used to fit nonlinear functions by applying to the hidden layers. In this study, we adopt two activation functions, sigmoid function and SELU function, defined as follows: 10.1029/2020SW002445

Space Weather
The sigmoid function is widely used as the activation function in neural networks, which maps the input into the interval (0, 1). However, in multiple-layer neural networks, the sigmoid function may cause the problem of vanishing gradients, which will impede the training of the neural network and obtain a bad result. Rectified linear unit (RELU) activation function is thus proposed to solve the problem in deep neural networks, which is defined as relu(x) = max (0, x) (Glorot et al., 2011). More recently, Klambauer et al. (2017) proposed the SELU activation function, which is similar to the RELU activation function but reported to obtain better results compared with RELU.
Regularization is an important method to prevent overfitting of neural networks. The L1 and L2 regularization methods are widely used in neural networks. Both L1 and L2 regularization restrain the training parameters (weights) by adding a penalty term to the loss function. The penalty term of the L1 regularization is the sum of the absolute values of the weights, while the penalty term of the L2 regularization is the sum of the squares of the weights. A critical feature of the L1 regularization is that it makes parameters more likely to become zeros, making it a good method for feature selection (Tibshirani, 1996). In this study, the L1 regularization is applied to the scaling transformation layer and the L2 regularization to the other hidden layers, and the final loss function is where N is the size of the data set, N 1 is the number of weights applied with L1 regularization, N 2 is the number of weights applied with L2 regularization, f(x i ) is the prediction of the neural network on the input x i , y i is the corresponding true value, w is the weight, and k 1 and k 2 are hyperparameters. The first term on the right of Equation 4 is about the difference between the model outputs and true values, which is calculated as the mean square error to estimate the expectation of the flux from the probabilistic perspective, which we will explain in detail in section 3.4. The second term is the L1 regularization, and the third term is the L2 regularization.
For the techniques applied to the neural network, a scaling transformation layer with L1 regularization is an efficient technique we propose to improve the prediction results. The proposed scaling transformation layer is a "one-one" connected layer, which makes a scaling transformation of the input (wx + b), as shown in Figure 2b. The coefficients of the scaling transformation layer are training parameters of the neural network rather than specific values set manually. The reason that we add a scaling transformation layer with L1 regularization on an FCN is that we make this an in-built way of feature selection. The explanatory variables selected in this study may contain many unimportant features. The L1 regularization will make the scaling coefficients (w) more likely to become zeros, thus dramatically eliminating the influence of the unimportant features.
The final structure of our FFN model is as follows, shown in Figure 3.
All inputs are normalized by subtracting their respective mean values and dividing by their respective standard deviations. The optimizer selected for the neural network is the Adam optimizer, which uses historical information of derivatives. To show the reasonability of the scaling transformation layer and SELU activation function, the experiment results of neural networks with different structures are listed in Table 1, which may be more comprehensible after section 3.3.

Performance Metrics
Generally, the metrics to evaluate a model's performance include the root mean squared error (RMSE), the PE, and the linear correlation coefficient (LC) between observations and predictions. The RMSE, PE, and LC are defined as follows: where N is the size of the data set, y i is the observation, y is the mean value of y, f i is the prediction, Var(y) is the variance of y, and Cov(y, f) is the covariance of y and f. The smaller the RMSE is and the closer the PE and LC are to 1, the better the performance is.

Quantile Regression Method
From the probabilistic perspective, the relativistic electron flux of a single day is not a single value but a random variable with a probability density function. The expectation, median, and quantiles of the flux can be obtained from this perspective. The uncertainty of the prediction can be presented by the probability density function. We introduce the quantile regression method here to predict quantiles and the approximate shape of the probability density function of the flux.
Quantile regression is a method that can not only predict a single value but also many quantiles of the response variable (Hallock & Koenker, 2001;Koenker & Bassett, 1978). Quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities. The τth quantile indicates the probability that the flux is less than the quantile is τ.
A simple explanation of the quantile regression theory is provided as follows. Let Y be a real-valued random variable with cumulative distribution function Define the loss function: The τth quantile can be found by minimizing the expected loss of y-f with respect to f (f can be seen as the output of the neural work): By setting the derivative of the expected loss function (9) to 0, we can get the solution: Let the q τ be the solution. The above equation reduces to F Y (q τ ) − τ = 0 and then F Y (q τ ) = τ. Hence, q τ is τth quantile of Y.
The above analysis indicates that by minimizing the expected loss function (8), the τth quantile can be obtained. This is exactly consistent with the training process of neural networks, which is trained to make the loss function as smaller as possible through many training epochs. As a result, it is natural to combine the neural network with the quantile regression method. By setting the loss function of the neural network as the form of the loss function of the quantile regression (8), the τth quantile can be predicted by the neural network. We further refer to this model as the single-quantile prediction model.
Moreover, we point out that from the probabilistic perspective, the expectation of the flux can be obtained by minimizing the mean square error loss function (the first term of function (4)), which is explained as follows: By setting the derivative of the expected loss function (11) to 0, we can get the solution: Let the f 0 be the solution. Equation 12 reduces to f 0 ¼∫ þ∞ −∞ ydF Y y ð Þ, which means f 0 is the expectation of Y.

Multiple-Quantiles Prediction Method
In addition to predicting a specific quantile of relativistic electron flux, the combination of the neural network and the quantile regression can also be used to predict multiple quantiles at the same time, which can further predict flux ranges and the corresponding probabilities. Moreover, the approximate shape of the probability density function of the flux can be predicted.
In this study, we design two more complex combined neural networks to predict multiple quantiles at the same time. The first combined neural network is designed with three outputs, which are for predicting the (0.5 − α)th, the 0.5th, and the (0.5+α)th quantiles, as shown in Figure 4. It consists of a scaling transformation layer, a fully connected hidden layer, and a sparsely connected hidden layer. From another perspective, the neural network is a combination of three neural networks for predicting three specific quantiles. In this way, the three neural networks share the same scaling transformation layer and the first hidden layer but keep their own second hidden layers.
For the single-quantile prediction model introduced in section 3.4, we can predict different quantiles by setting τ to different values in the loss function (8). But the predicted quantiles may not be in the right order. To solve this problem, we add a penalty term in the loss function of the multiple-quantiles prediction model, as shown from Equations 13 to 17.
The loss function of the combined neural network is designed as The first term on the right side of Equation 13 represents three loss functions to predict three quantiles, while the second term on the right side is designed as a penalty term to keep three quantiles in the right order. The 0.5th quantile is the median of the relativistic electron flux. The (0.5 − α)th and (0.5 + 0.5 α)th quantiles can determine a flux range b y 0:5 − α ; b y 0:5 þ α Â Á with a probability of 2α.
The other combined neural network is similar to the three-quantiles prediction network, while it has nine outputs, which are for predicting 0.1th, 0.2th, 0.3th, …, 0.9th quantiles, as shown in Figure 4b. Similar penalty terms to that in Equation 13 are added in the combined loss function to avoid the order inconsistency of different quantiles. From the nine quantiles, the approximate shape of the probabilistic density function of the flux can be obtained.

Result of the Single-Value Prediction by the FFN With a Scaling Transformation Layer
The relativistic electron fluxes (>2 MeV) are predicted 1 day in advance with a total of 85 inputs, which consist of five consecutive daily averaged data of 17 input variables. To evaluate the model thoroughly, the time interval is divided into seven continuous periods with each period corresponding to 1 year approximately.
For each year, the other 6-year data are used to train the model and predict the relativistic electron flux of that year subsequently. The performance of the model is evaluated for each year as well as the whole period.
The results of the predictions in seven years are shown in Figure 5. Furthermore, we compare our model with several other models, as shown in Table 2. We list the performance results of the persistence model and the multiple linear regression model (linear model) in the same period as the baseline models. The results of the empirical mode decomposition model (Qian et al., 2019), the nonlinear autoregressive moving average with exogenous inputs (Boynton et al., 2015), the multivariate autoregressive model plus Kalman filter (Sakaguchi et al., 2013), and the support vector machine (Wang & Shi, 2012) are also listed in Table 2. It shows that our FFN model performs well.

Result of the Single-Quantile Prediction
We combine the quantile regression method with the FFN by adopting the loss function of the quantile regression to predict the quantiles of relativistic electron flux. A quantile is predicted by a specific neural network. The Year 2011 is selected to present the result, as shown in Figure 6. Two periods of 2011 are selected to show the prediction result in detail. It shows that the predicted values of high quantiles are usually higher than the observations, while the predicted values of low quantiles are usually lower than the observations, which is consistent with the definition of the quantiles. For the τth quantile, the predicted frequency that  NARMAX (Boynton et al., 2015) AR-KLM (Sakaguchi et al., 2013) SVM (Wang & Shi, 2012)

Space Weather
the flux is less than the quantile is τ. The number of the "correct predictions," which means the observed flux is less than the predicted quantile, and the corresponding observed frequency, which is the ratio between the "correct predictions" and the total data, are shown in parentheses in Figure 6a. It shows that the single-quantile prediction model performs well in 2011.
Moreover, in order to thoroughly evaluate the reliability of the quantile regression method, the reliability diagram is plotted in Figure 7.
The performances of the model in the seven years are presented by the consistency between the predicted frequencies and the corresponding observed frequencies that the flux is less than the quantile.

Result of the Multiple-Quantiles Prediction
The reliability diagrams of the combined neural network for predicting three quantiles are shown in Figure 8, where the Years 2011 and 2014 are selected to show the results. The red dashed line with red triangles is the result of the single-quantile prediction. The results of the three outputs of a specific neural network are shown by the three dots of the same color. Different colored dots correspond to different neural networks. The first float number in the parentheses is the

10.1029/2020SW002445
Space Weather mean absolute error between predicted frequencies and observed frequencies over (0.5 − α)th, 0.5th, and (0.5 + α)th quantiles. The second float number in the parentheses is the frequency that the observed flux is in the interval b y 0:5 − α ; b y 0:5 þ α Â Á . It shows that there are no great changes in the model performance compared with the single-quantile prediction model. More specifically, if the interval between the high quantile b y 0:5 þ α and the low quantile b y 0:5 − α is large, the model will perform well. However, if the interval is very small, such as α = 0.05 in this case, the prediction performance may fall quite a lot, as Figure 8a shows. We think the reason is as follows. If the interval between two quantiles were small and there had no penalty term, the two quantiles would be more probably in the wrong order. Hence, the penalty term in the loss function plays a more important role to ensure the order consistency of different quantiles with small intervals, and this will influence the prediction performance of the combined neural network.
The reliability diagram of the nine-quantiles prediction model for predicting 0.1th, 0.2th, …, 0.9th quantiles is shown in Figure 9a. It shows similar characteristics of the reliability diagram of the single-quantile prediction (Figure 7), with the mean absolute frequency error being 0.057. From the nine quantiles, the approximate shape of the probabilistic density function of the flux can be obtained. The first 5 days of the Year 2011 are selected to show the predicted probability density functions of the flux. Most of them show unimodal structures. Moreover, we combine all the predicted probability density functions of the flux in a year to get the accumulated flux data distribution and compare it with the observed flux data distribution. In this way, we use the nine predicted quantiles to represent the probability density function of a single day, which can be seen as samplings of the probability density function. Then we combine all the predicted quantiles and plot the normalized histogram to show the accumulated flux data distribution. The Years 2011 and 2014 are selected to show the results, as shown in Figures 9c and 9d. It shows that the accumulated flux data distribution of the quantile is very close to the observed flux data distribution in the Year 2011, while the accumulated flux data distribution shows a deviation from the observed flux data distribution in the Year 2014. In conclusion, the combined neural networks for predicting multiple quantiles performs well.

Conclusion
The neural networks are powerful data-based models that can capture and represent complex input/output correlations. In this study, a FFN with a scaling transformation layer is thus designed to predict relativistic electron flux at GEO 1 day in advance during the periods of 2011-2017. The explanatory variables consist of six solar parameters (T, N, V, P, B, and Ey), four geomagnetic indices (Kp, Dst, ap, and AE), and electron From the probabilistic perspective, the relativistic electron flux is not a single value but a random variable. The single-value prediction which adopts the mean square error loss function is to predict the expectation of the flux. Correspondingly, the quantiles of relativistic electron flux can be obtained by the combination of the FFN and the quantile regression method. By adopting a special loss function, we combine the FFN and the quantile regression method to predict the quantiles of relativistic electron flux, and it performs well. The combination of the neural network and the quantile regression is a powerful method. By designing multiple-quantiles prediction models for predicting multiple quantiles at the same time, the flux ranges of relativistic electrons can be predicted with the corresponding probabilities, which is an advantage over the single-value prediction. A nine-quantiles prediction model is designed to predict nine quantiles at the same time, which can further predict the probability density function of the flux.