Confidential manuscript submitted to Space Weather Forecasting Megaelectron-Volt Electrons inside Earth’s Outer Radiation Belt: PreMevE 2.0 Based on Supervised Machine Learning Algorithms

Here we present the recent progress in upgrading a predictive model for Megaelectron-Volt (MeV) electrons inside the Earth’s outer Van Allen belt. This updated model, called PreMevE 2.0, is demonstrated to make much improved forecasts, particularly at outer Lshells, by including upstream solar wind speeds to the model’s input parameter list. Furthermore, based on several kinds of linear and artificial machine learning algorithms, a list of models were constructed, trained, validated and tested with 42-month MeV electron observations from Van Allen Probes. Out-of-sample test results from these models show that, with optimized model hyperparameters and input parameter combinations, the top performer from each category of models has the similar capability of making reliable 1-day (2-day) forecasts with Lshell-averaged performance efficiency values ~ 0.87 (~0.82). Interestingly, the linear regression model is often the most successful one when compared to other models, which indicates the relationship between 1 MeV electron dynamics and precipitating electrons is dominated by linear components. It is also shown that PreMevE 2.0 can reasonably predict the onsets of MeV electron events in 2-day forecasts. This improved PreMevE model is driven by observations from longstanding space infrastructure (a NOAA satellite in low-Earth-orbit, the solar wind monitor at the L1 point, and one LANL satellite in geosynchronous orbit) to make high-fidelity forecasts for MeV electrons, and thus can be an invaluable space weather forecasting tool for the future.


Introduction
Man-made satellites operating in medium-and high-altitude Earth orbits are continuously exposed to hazardous space radiation originated from different sources. Among them, one major contributor is the relativistic electron population-with energies comparable to and/or larger than their rest energy of 0.511 megaelectron-volt (MeV)-trapped inside Earth's outer Van Allen belt. Owing to their high penetration capability, these MeV electrons are difficult to be fully stopped by normal shielding. Particularly, during MeV electron events when electron intensities across the outer belt are greatly enhanced to sustaining high levels, space-borne electronic systems with inadequate hardening are susceptible to deep-dielectric charging and discharging phenomenon caused by those electrons (Lai et al., 2018) and thus may suffer severe damages or even stop functioning. Therefore, protecting critical space infrastructures from harsh space weather conditions-including MeV electron events-has high priority for stakeholders such as the space industry, service providers, and government agencies.
Similar to terrestrial weather services, real-time monitoring and model forecasting are the two principle ways of mitigating risks from outer-belt MeV electrons. Given the successful NASA Van Allen Probes mission, also known as RBSP (Mauk et al., 2013), ended in October 2019, the need of reliable forecasting models for MeV electrons becomes compelling once again due to the coming absence of in situ measurements. Indeed, forecasting models have been developed including SPACECAST framework (Horne et al., 2013) for the whole outer radiation belt, and Relativistic Electron Forecast Model (based on Baker et al., 1990) currently operated by NOAA specifically for electrons at geosynchronous (GEO) orbit. Recently, Chen et al. (2019) has developed and verified a new predictive MeV electron model called PreMevE to forecast MeV electron events throughout the whole outer radiation belt, using simple linear filters with inputs mainly from low Earth orbit (LEO) observations. In its preliminary form, PreMevE model in Chen et al. (2019) was composed of two submodels, one with the main goal to predict the onsets of MeV electron events and the other to forecast electron flux levels. This model takes advantage of the cross-energy, cross-L-shell, and cross-pitch angle coherence associated with wave-electron resonant interactions-particularly the fact that MeV electron enhancements lag behind low-energy (~100 s keV) electron precipitation by hours (Chen et al., 2014 and, ingests satellite observations from belt boundaries-in low-altitude and geosynchronous orbits, and makes high-fidelity nowcast and forecast of MeV electron fluxes over L-shells between 2.8 and 7. Using simple filters trained with~60 days of data, the PreMevE has shown great potential in forecasting MeV electron distributions over years with no need of in situ MeV electron measurements (except for at GEO). Details of the model can be found in Chen et al. (2019). In this work, we further improve PreMevE model by applying and testing several supervised machine learning algorithms with optimized selection of input parameters.
Meanwhile, the application of ML also has gained momentum in the space weather community. An early use of artificial neural networks (NNs) to predict the flux of energetic electrons at GEO orbit was presented by Stringer et al. (1996) in which GOES-7 data were used to make 1-hr nowcasts of hourly averaged fluxes of electrons at energies of 3-5 MeV. Later, Ukhorskiy et al. (2004) and Kitamura et al. (2011) used artificial NNs to develop 1-day forecasts of daily averaged electron fluxes at GEO. More recently, Shin et al. (2016) used an NN scheme with solar wind inputs to predict GEO electrons over a wide energy range with different time resolutions. Wei et al. (2018) also successfully improved the 1-day forecasts of >2-MeV electron fluxes at GEO by applying deep learning algorithms. For a review, Camporeale (2019) has summarized the recent progresses and opportunities of applying ML for space weather forecasting problems, including predicting geomagnetic indices, relativistic electrons, solar flares occurrence, coronal mass ejection propagation time, solar wind speed, and other applications.
The purpose of this work is to present how PreMevE has been upgraded with ML algorithms to make improved predictions of MeV electron flux distributions. With no requirement of in situ MeV electron measurements except for at GEO, this unique model has shown its great potential of meeting the predictive requirements for outer-belt electrons during the post-RBSP era. Section 2 briefly describes data and parameters to be used for this study, and the selected ML algorithms and their implementations are explained in section 3. Section 4 compares and summarizes the prediction performance of different models, followed by detailed discussions in section 5. This work is concluded by section 6 with a summary of our findings and possible future directions.

Data and Input Parameters
Electron data used in this work include observations made by particle instruments aboard an RBSP spacecraft, one Los Alamos National Laboratory (LANL) GEO satellite, and one NOAA Polar Operational Environmental Satellite (POES) in a time period ranging from February 2013 to August 2016, as shown in Figure 1. Electron data used here are the same as in Chen et al. (2019) in which detailed descriptions of the original data and their preparation can be found, and here is a brief recap. First, trapped 1-MeV electrons across a range of L-shells (L ≤ 6) are in situ measured by the Magnetic Electron Ion Spectrometer (MagEIS) instrument (Blake et al., 2013) on board RBSP-a, and the spin-averaged fluxes are plotted in Panel A as a function of L-shell and time. Here we use McIlwain's L values (McIlwain, 1966) calculated from the quiet Olson and Pfitzer magnetic field model (Olson & Pfitzer, 1977) together with the International Geomagnetic Reference Field model. Data from POES satellites have been previously used to both model and forecast electrons with energies ranging from 100 keV up to 2 MeV measured by RBSP (e.g., Allison et al., 2018;Chen et al., 2016;Chen et al., 2019). At GEO, we use observations from the Synchronous Orbit Particle Analyzer (SOPA, Belian et al., 1992) instrument carried by the GEO satellite LANL-01A. For simplicity, all GEO fluxes are put on the fixed L = 6.6 and plotted in the top of Panel A. Then, precipitating electrons are monitored by the Space Environment Monitor 2 (SEM2) instruments on board NOAA POES satellites in LEOs (Evans et al., 2000), and the count rates from the 90°telescopes on NOAA-15 are presented for three energy channels as in Panels B, C, and D. Here L values for NOAA-15 are calculated from the International Geomagnetic Reference Field model. Additionally, upstream solar wind (SW) speeds in Panel E are downloaded from CDA web site and added to models' inputs, based on Figure 1. Overview of electron observations and solar wind speeds used in this study. All panels present for the same 1289-day interval starting from 2013/02/20. Panel (a) shows flux distributions of 1-MeV electrons, the variable to be forecasted (i.e., targets). Similarly, Panels (b) to (d) show count rates of precipitating electrons measured by NOAA-15 in a low Earth orbit, for E2, E3, and P6 channels, respectively. Panel (e) plots the solar wind speeds measured upstream of the magnetosphere as in the OMNI data set for the period. Data in Panels (b) to (e) are model inputs (i.e., predictors). their high geo-effectiveness previously identified (e.g., see Wing et al., 2018, and references therein). All RBSP-a, LANL-01A, and POES-15 electron fluxes as well as solar wind speeds in Figure 1 are binned by 5 hr to allow for RBSP's full coverage on the outer belt for each time bin. The L-shell bin size for electrons is 0.1.
Throughout this work, we refer to POES electron fluxes at >100 keV, >300 keV, and >1000 keV as E2, E3, and P6, respectively. Logarithmic values of E2, E3, and P6, along with standardized scaled values of SW speeds form the input data sets, or the predictors, being used to forecast the logarithm of 1-MeV trapped electron fluxes, sometimes also referred to as "target." The standardization of SW is done by subtracting the mean and dividing by the standard deviation (both the mean of 404.8 km/s and standard variation of 86.8 km/s are computed with the training set as defined in section 4). Hereinafter, when we refer to 1-MeV target, E2, E3, and P6 fluxes, we are actually referring to their logarithmic values. L-shell coverage of this study is confined to 2.8-6 (the range of RBSP) and 6.6 (LANL GEO), while fluxes at other L-shells can be derived by radial interpolation or extrapolation (Chen et al., 2019).

Supervised Machine Learning Algorithms
ML can be described as a collection of techniques in which systems improve their performance through automatic analysis of data. The power of ML models lies in their capacity to extract statistical information (patterns and features) from ample data with no requirement of hypothesis, in a sharp contrast to physicsbased models in which researchers manually select parameters to be used as input for models with specific governing physics. ML models are capable of extracting signature and correspondence that might be overlooked by traditional methods, for example, nonlinear relationship, and can be relatively easy to use with multiple input sources. Therefore, under certain circumstances, ML models can outperform traditional ones. For example, Tajbakhsh et al. (2016) found that deep NN models outperformed handcrafted solutions in medical image analysis tasks. Nevertheless, one major drawback of ML models, particularly deep NNs, is its incomplete capability in interpretability ("how") and explainability ("why") (Murdoch et al., 2019). Thus, sometimes ML models can be complicated to explain, hindering our ability to propose new theories based on ML results.
Common ML algorithm types include supervised, unsupervised, semisupervised, and reinforcement learning (Ayodele, 2010). Algorithms used here fall under the category of supervised learning as they make use of input sample data paired with an appropriate label. The label here refers to 1-MeV electron flux at different L-shells, the target value to be forecasted. Moreover, the models implemented here can be classified as regressions, as the labels are specified scalar values.
As explained by Camporeale (2019), supervised regressors try to find the mapping relationship between a set of multidimensional inputs x = (x 1 , x 2 , …, x N ) and its corresponding scalar output label y, under the general form where f : R N → R is a linear or nonlinear function and ϵ represents additive noise. All methods used to find the unknown function f can be seen as an optimization problem where the objective is to minimize a given loss function. The loss function is a function that maps the distance between all the predicted and target values into a real number, therefore providing some "cost" associated with the prediction. The following four subsections provide details in each one of the supervised regressor models used in this study. A comprehensive discussion on artificial NNs and deep learning models can be found in LeCun et al. (2015), with information about techniques common to several artificial intelligence applications. To exemplify the supervised learning problem as a flux forecasting task, consider predicting the 1 MeV electron fluxes at time t at GEO shell using the past values of 1-MeV electron fluxes at GEO. Suppose we use M training samples to perform the analysis, and the number of past values we wish to use for each time step is four (N = 4). That is, we have M pairs of (x t , y t ) training samples, or {(x 1 , y 1 ), (x 2 , y 2 )…, (x M , y M )} where x t = (x t − 1 , x t − 2 , x t − 3 , x t − 4 ) T ∈ R N = 4 and y t ∈ R. We can rewrite the predictors x t as a matrix X ∈ R NxM , where each column of the matrix X represents one x t training sample vector. The y t samples can also be defined as a single row matrix Y ∈ R 1xM . The goal of ML training is to optimize the internal parameter values of the given mapping function f-a specified ML algorithm-by minimizing the loss function associated with the noise matrix ϵ after inserting X and Y back into Equation (1). Here we use the past values of multiple input data, including E2, E3, P6, and solar wind speed, to forecast 1-MeV electron fluxes at each individual L-shell. Next, we describe the four selected algorithms, namely, linear regression, multilayer perceptron, convolutional NN (CNN), and long short-term memory (LSTM) methods.

Linear Regression
Linear regression is perhaps the simplest supervised learning method, while sometimes it is also interpreted as the simplest ML algorithm. This algorithm has a vast range of applications and constitutes a basic building block for more complex algorithms. The linear regression equation is given by where w is a vector containing weights and b is the bias term. In a predictive problem, y as in Equation (1) represents the label, or target, to be predicted (the 1-MeV electron flux), x represents the input data (e.g., past values of precipitating electron fluxes), and w represents the set of linear coefficients that minimize the loss, or the sum of the errors between all true values of y and the predicted f(x i ). From the optimization perspective, the weights w can be obtained using a simple ordinary least squares method. Linear models are simple models generally very useful as baselines, and their selection for this work is also due to the success of previous work by Chen et al. (2019).

Feedforward NNs
Starting from the linear model, a single neuron can be defined as where a(.) is an element-wise activation function. The activation function introduces nonlinearity to the model. Some of the most common activation functions are the Rectified Linear Unit (ReLU, Hahnloser et al., 2000;Nair & Hinton, 2010) and the Exponential Linear Unit (ELU, Clevert et al., 2015). ReLU outputs the maximum value between zero and the input, that is, ReLU outputs zero when the input is negative, maintains the input otherwise. ELU uses a logarithm curve for negative values of the input and keeps the positive inputs values unchanged. Neurons that take in a set of inputs (x) and produce an output f can be combined into layers. Using the f l ½ i notation to represent the output of the neuron i at layer l, we can rework Equation (3) for the upcoming layer as f l+1 = a(w T f [l] + b) to represent the inputs for layer l+1 depending on the outputs of the previous layer l. Figure 2 illustrates a single neuron in the left and how sets of neurons can Figure 2. Visual generic representation of a single neuron and an artificial neural network. Panel (a) shows a single neuron that can be split into linear and nonlinear components, as well as the input and output data. In the case of a forecasting problem, the inputs can be data representing past times t −1 ,t −2 ,t −3 ,t −4 , and the output is prediction at current time t 0 or even some future time. Panel (b) shows how a set of neurons constitute a layer and how the output of a layer can be used as input for the next layer. be combined to form layers and NNs in the right. Here the information flows from left (the input) to right (the output) as in Figure 2b, and this structure is a class of feedforward NNs (FNNs). An artificial NN is a model consisting of connected neurons. The term deep model or deep learning is generally used for NNs containing more than one hidden layer.

Convolutional NNs
CNNs are powerful and influential deep learning model architectures. The computer vision field strongly adopted CNNs as their workforce after the CNN described in Krizhevsky et al. (2012) has achieved new levels of accuracy in the popular ImageNet Large Scale Visual Recognition Competition (Russakovsky et al., 2015). All CNNs make use of the fundamental convolutional kernel. Convolution operates on two functions, one generally interpreted as the "input," and the other as a "filter." The filter is commonly referred to as "kernel." The kernel is applied on the input, producing an output image or signal. During the training stage, the values of kernels are updated in such a way that the output generated by the CNN is more similar to the desired label, that is, minimizes the cost. Just like the neurons described in section 3.2, a set of convolutional kernels can be combined into layers. Dumoulin and Visin (2016) showed details on the arithmetic of convolutions for deep learning. Here, we provide only the essential equation for 1D convolution. A 1D convolution of the input vector x and the kernel g of length m (the window width of convolution) is given by A CNN unit in deep learning models is a composite of activation function and the convolution term in Equation (4), ie, f(i) = a((x * g)(i)). A value of four was used for m in this study. Springenberg et al. (2014) observed that CNNs commonly use alternating convolution and max-pooling layers followed by a small number of fully connected layers. The models are typically regularized during training by using dropout. Max pooling are simple down-sampling steps in which the maximum value for each patch (containing multiple values) of a feature is used to represent the entire patch, effectively reducing the feature size. Dropout layers randomly select a percentage of their inputs to be ignored during the training phase. Dropouts are useful to avoid overfitting. Dropout is a general approach and not specific for CNN models. Srivastava et al. (2014) showed that dropout improves the performance of NNs on many supervised learning tasks such as speech recognition, document classification, vision, and computational biology.

Long Short-term Memory
LSTM networks are a popular recurrent NN (RNN) structure introduced by Hochreiter and Schmidhuber (1997). RNN is a class of artificial NNs in which neurons can be connected to form a directed graph along a temporal sequence ( Figure 3). Different from traditional feedforward NNs, LSTM has internal loops to allow to retain information from previous time steps and decide its usage for predictions. Indeed, the LSTM basic unit is called memory cell inside which internal components can decide when to keep or override information in the memory cell, when to access the information in memory cell, and when to prevent other units from being perturbated (Hochreiter & Schmidhuber, 1997). Olah (2015) provides a detailed walkthrough of the LSTM components. LSTMs are constantly used in speech recognition problems (e.g., Graves et al., 2013;Graves & Schmidhuber, 2005) as well as forecasting (e.g., Kong et al., 2019). Here LSTM was selected for testing as a representative of RNNs. In LSTM models, the basic unit h is also called a memory cell. The input vector x at an arbitrary time t is processed by a memory cell h, which produces an output f(x). The output produced by h t − 1 is also part of the input for h t . Thus, events at time t are processed with information from the previous steps. The output produced by h can be used as input to the next layer just like the described for the previous models.

Testing Algorithms and Model Performance
Following ML best practices, we split the data into training, validation, and test sets. The training set is the data effectively used for model optimization. The validation set is used to tune model hyperparameters, such as the number of neurons/layers or optimization options, and to eliminate underperforming models. Finally, the test set is reserved for model performance evaluation on the final stage. Here, the training data set consists of observations in the first 4,008 time bins (roughly 835 days, or 27.4 months, 65% of the whole data), the validation set has observations for the next 841 time bins (roughly 175 days, or 5.8 months, 14% of the data), and the test set is for the final 1,280 time bins (roughly 267 days, or 8.8 months, 21% of the data).  Observational data are split in such a manner so that the major observational gap over days 840-850 is in between the sets; thus, the models are always trained, validated, and tested in segments containing continuous observations.
The optimization goal for all the models is to reduce the root-mean-square error (RMSE) between the real value y and the predicted value f at each individual L-shell, both with the size M. RMSE is defined as ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi In this study, linear models minimize the error using the ordinary least squares, while artificial NN models use Adam optimization as defined by Kingma and Ba (2014). Chen et al. (2019) has demonstrated that E2 fluxes can be used for predicting the onset timings of MeV electron events, and here we also computed the normalized temporal derivatives of E2 fluxes, naming it dE2, and tested by adding it to the input data sets for predicting onsets. The dE2 at any given time bin t for E2 is defined as dE2 t ¼ E2t −E2t−1 E2t−1 (note the models forecast for the future at a time t+Δt based on the input data at current and/or past time t). The temporal correlation between E2, dE2, and trapped MeV electron fluxes can be recognized from Figure 4.

Test Input Parameter Combinations
Our first experiment tests different combinations of input parameters with the objective to find the set of input data that can best predict 1-MeV electrons. Specifically, we use linear and LSTM models to evaluate what combination of input parameters yields the highest performance efficiency (PE). PE provides a measure of quantifying the accuracy of predictions by comparing to variance of the target. Naming y as the true value (the logarithm of the target 1-MeV electron flux) and f as the predicted value, both with size M, PE is defined as where y is the mean of y. PE does not have a lower bound, and the perfect score is 1.0, meaning all predicted value perfectly match observed data, or that f = y.
To make 25-hr (i.e., five time bins, also called 1-day hereinafter) forecasts of MeV electrons for each individual L-shell, our models ingest the past values of the input data at the same L-shell. The only exception is at GEO, where the model inputs also include past MeV electron fluxes at GEO from in situ measurements. Additionally, Chen et al. (2019) found E2, E3, and P6 values at GEO have relatively weak correlations with 1-MeV electrons, E2, E3, and P6 channels at L-shell of 4.6 are used instead for model inputs. In this way, our models were trained independently for each individual L-shell. The term "window size" refers to how many 5-hr time bins of input data are needed by the models. Chen et al. (2019) found a window size of 15 time bins (equivalent to 75 hr) to be effective for the forecast of MeV electrons. Adhering to the "power of two" ML convention, here we used a window size of 16. The "power of two" rule is based on the fact that CPUs and GPUs memory architecture are usually organized in powers of two; thus, using power of two data organization can be beneficial for computation efficiency. For naming convention, when an LSTM model has one layer with 128 memory cells, we use LSTM-128 as the name for this model; the linear models are referred to as LinearReg throughout the manuscript. Here results from the submodel 1 and 2 of previous PreMevE in Chen et al. (2019) are always cited as linear1 and linear2 for a baseline comparison. Note in this work all PE values and predicted fluxes from linear1 and linear2 are for 1-day forecasts only. Table 1 summarizes the overall PE values (averaged over all L-shells) for 20 tests performed for 1-day predictions. For each of the two categories of models, 10 input parameter sets are tested, starting from each single parameter to various combinations. Here we focus on the validation PE values, that is, those in the column of "PE validation," to judge model performance. The general trend is that more parameters lead to better performance. For example, the last LinearReg model with all parameters as input (the 10th model) has not only the highest validation PE value of 0.830 but also the highest PE at GEO (0.587). These two values are higher than those for linear2 (0.753 and 0.352), which indicates significant improvements. Linear1 was designed for capturing onset timings of MeV electron events, thus its PE values are always smaller than those of linear2 (Chen et al., 2019). Interestingly, the last two LSTM models (19th and 20th) have the highest validation and GEO PE values for this category, but still slightly lower than those of the 10th LinearReg model. Indeed, PE values for test data in Table 1 show the same varying trend, and thus the same is true for PE values for combined validation and test (i.e., out-of-sample) data.
In this step, we also confirmed that adding SW speeds to the input list improves model performance, which was not tested previously in Chen et al. (2019). In Table 1, the validation PE for the 1st model by using SW speed as the sole input parameter is 0.289, which suggests this simple model can predict MeV electrons over the whole outer belt to some degree but not as well as the linear2, although the PE of 0.557 at GEO is much higher than that of linear2 (0.352). In comparison, PE values from the 11th model show that using SW speed as the sole parameter for LSTM model does not work as well as for the 1st LinearReg model particularly at

Space Weather
GEO. When comparing models without and with SW speeds, for example, the 2nd vs. 7th (12th vs. 17th) and 8th vs. 9th (18th vs. 19th), improvements in validation PE are 0.011 (0.034) and 0.011 (0.029), respectively, while the improvements in PE at GEO are more significant up to 0.11. We also tested the dE2, and its addition to the input has effects less significant than SW speeds when comparing the PE values of the 10th (20th) model to those of the 9th (19th). Details of how model PEs improve as a function of L-shell are presented in Figure 5. For models in both categories, the top performer has much higher PE values than those of linear2 across the whole belt, with the most significant improvements for outer L-shells >~4.5 and the maximum increment of >0.4 at L~5.5. It can be clearly seen from the green curve in Panel A that the SW speed is a very helpful predictor at outer L-shells (L >~5) especially for LinearReg models, but inefficient for inner Lshells. This can be explained by the fact that in the high L-shell region particle dynamics are more controlled by adiabatic effects and is also consistent with the experience from existing predictive models for electrons at GEO (e.g., Baker et al., 1990) as well as the significant difference between MeV electrons at GEO and L-shells below 4.5 recently reported by Baker et al. (Baker et al., 2019). However, as in Panel B, the LSTM model using SW speeds as the sole predictor has only a few L-shells with PE values greater than zero. In summary, results in both Table 1 and Figure 5 suggest that the model PE values are higher with the use of more input data from multiple precipitating electron channels as well as the SW speed. Therefore, tests in the rest of this study generally use the parameter combination including all inputs.

Model Selection and Evaluation Metrics
We then advanced to test a list of models built upon different algorithms with varying model hyperparameters (e.g., the window size and number of neurons). There are four different categories of models-Linear, FNN, LSTM, and CNNs-as described in section 3, and here are how these models and test runs were set up. First, to account for cross-shell information as in Chen et al. (2019), some tests include E2 data at the L-shell of 4.6 as input for all other L-shells. Second, all FNN models presented here are composed of two hidden layers-the first one has 64 neurons and the second has 32 neurons, and the neurons use ELU as the activation function. In our early testing, we discovered ELU achieving marginally better performance than the most adopted ReLU activation function. A dropout layer that randomly selects 50% of the input to be inactivated after each one of the activation functions is included to help prevent overfitting. The output layer consists of a single neuron without an activation function. The dropout layer, used during training and deactivated during prediction, and the output layer are not accounted as hidden layers, but are also part of the model. We name such models FNN-64-32-elu. Then, CNN models with a window size 16 are composed of two convolutional layers; the first convolutional layer contains 64 kernels followed by a max pooling layer with size and stride equal to two, and the second convolutional layer contains 32 kernels followed by a max pooling layer with the same size and stride. The CNN models with a window size 4 are composed of a single convolutional layer with 64 kernels followed by a max-pooling layer. The kernels are one-dimensional with a size of three (i.e., the m in Equation (4)) and use ReLU as activation function. The convolutional layers are finalized with 50% dropout. The output layer consists of a single neuron without an activation function. Those CNN models are named Conv-64-32 and Conv-64, respectively. Finally, LSTM models follow the same structure as the ones described in section 4.1. Model performance is again evaluated by PE values by comparing forecasts to the target data. Table 2 presents the overall PE values for 24 test runs performed for 1-day predictions, and Table 3 presents PE values for the same test runs for 50 hr (i.e., 10 time bins, also called 2-day hereinafter) predictions. Inside each category, the effects of window size, neuron/layer numbers, and input parameters are tested and compared, and Tables 2 and 3 only show results of models with good performance. For 1-day forecasts as in Table 2, the sixth LinearReg model has the high overall PE of 0.872 for out-of-sample test and 0.587 at GEO (higher than the PE of 0.303 at GEO for 1-day persistence forecasts). Top performers in the other three categories have similar overall and GEO PE values. All those values are higher than the overall PE of 0.797 and GEO PE of 0.352 from linear2 for 1-day forecasts. For 2-day forecasts in Table 3, top performers are the same as for 1-day predictions except for the FNN category. Here, the sixth LinearReg model has the highest overall PE of 0.827 for out-of-sample test and 0.333 at GEO (higher than the PE of −0.245 at GEO for 2-day persistence forecasts). Again, top performers have overall PEs~0.82 for 2-day predictions, which is lower than the~0.87 for 1-day predictions but still higher than the~0.80 of linear2 for 1-day forecasts. Chen et al. (2019) has shown that the linear2 have lower PE for 2-day forecasts than 1-day forecasts. Their PEs at GEO are mostly above 0.33, comparable to linear2. Note for the FNN category, the ninth model is the top performer in Table 3, with no E2 at L = 4.6 for input-instead of the 11th in Table 2. None of the CNN models in Table 3 can make 2-day forecasts at GEO very well. Figure 6 plots PE curves for both 1-and 2-day forecasts as a function of L-shell, which further confirm our models' performance are more robust than previous results published in Chen et al. (2019). First, the PE curves for all four top models cluster together, with PE values at outer L-shells (minimum >~0.3) lower than those at inner L-shells (maximum >0.8 in left panel and >0.7 for right). All PE curves for both 1-day and 2day forecasts are well above that from linear2 (1-day) expect at low L-shells for 2-day forecasts. The most significant improvements in PE are for L-shells >4.5. For 1-day forecasts, due to the addition of E2 at L = 4.6 to the parameter list, the sixth LinearReg model in Table 2 (the green thick line in Figure 6a) can be seen to outperform with higher overall PE than the 10th model in Table 1. In addition, the performance of LinearReg models is persistently good for both 1-and 2-day forecasts, particularly at GEO where other top models degrade quickly as in Figure 6.
It is striking how the models (LinearReg, FNN, LSTM, and CNN) show very similar forecasting ability when using similar input data. Plus, the LinearReg models seem to have leading performance for the forecasting in many scenarios, particularly for 2-day predictions. Two main observations should be taken for such behaviors. The first one is that a great part of the interplay between trapped 1-MeV electrons and input parameters (precipitating electrons and SW speeds) appear to be mostly linear, which was supported by the large linear correlation coefficient values calculated in Chen et al. (2019). Previous PreMevE in Chen et al. (2019) has high PE using linear filters to forecast MeV electrons, and our findings corroborate previous results. The second observation is that artificial NNs, as depicted in section 3, have their linear component. As a linear model achieves good results, artificial NNs are expected to do at least the same. Thus, the dominance of linear components explains why the top models from all four categories of algorithms have very similar predictive performance. In addition, the secondary role from nonlinear components makes CNN models having the best overall PE of 0.877 for validation and test set combined as in Table 2, as well as the FNN and LSTM models having the best PE at GEO (Tables 2 and 3). Therefore, this new PreMevE 2.0 model indeed includes all four algorithms, which form an ensemble of predictive models whose relative weights are left to future work for determination. Next, we take a closer look at predicted results from all four algorithms.

Detailed Predictions and Discussions
An overview of the 1-day forecasted flux distributions is exhibited in Figure 7 compared to the 1-MeV flux target. Visually, forecasted distributions from the four top performers as in Table 2   . One-day forecasts compared to target fluxes at three selected L-shells over the combined validation and test period. Panels (a) to (c) are for L-shells of 3.5, 4.5, and 5.5, respectively. The measured 1-MeV electrons (black) are compared to predictions from the LinearReg, FNN, LSTM, and CNN models with highest PE in each category (Table 1) as well as linear2 model (yellow).
(e.g., the one on~Day 1080) or even totally missed (e.g., the one on~Day 870). It is deemed acceptable at this stage since PreMevE model mainly aims to forecast high flux levels of MeV electrons. Similarly, Figure 8 compares 2-day forecasted results to target data and shows an akin resemblance, confirming the stable predictive performance of PreMevE 2.0 with a longer lead time.
Details of 1-day forecasts over the test data portion are shown in Figure 9. Panels in the left column presents predicted flux distributions, and panels in right highlight the errors-precisely how far the ratio of prediction and target differs from one (i.e., perfectly match in greenish color). Thus, negative values (blueish colors) in the right panels of Figure 9 indicate the model overpredicts whereas positive values (reddish color) indicate the models underpredicts. For example, with most in greenish color as the background, the red vertical stripes in right panels highlight when predictions lag behind the onsets of MeV electron events, while the blue regions at L~3.5 indicate the mismatched lower boundary L-shells of enhancements. Furthermore, Figure 11 shows even more details how closely the 1-day forecasted fluxes are compared to the targets over the combined validation and test period for selected L-shells. Here flux curves from the same models as in Figure 7 as well as linear2 model are plotted. The four PreMevE 2.0 model curves pack together tightly and trace the target curve (black) closely, particularly during decays of high intensity events. The closeness between the target and each forecasted curve depicts the performance of each model. A close inspection reveals that the linear2 curve (yellow) is often the one farthest away from the target, showing as almost the envelop line of the predictions, while the LinearReg curve (green) appears the closest tracer of target at L = 4.5, and the FNN curve (red) is the winner for other two L-shells. Nevertheless, it can be seen that the forecasted values often lag behind the target at onsets of MeV electron events, for example, the ones oñ Day 988 and 1093 at L = 4.5. Figure 12 illustrates how well the onsets of MeV electron events at L = 4.5 are captured by the models. Here forecasts from linear1 is also plotted in blue for comparison. (Linear1, or the Submodel 1, in Chen et al., 2019, is specifically designed to predict the onsets.) We selected 16 major events in which MeV electron fluxes increase by >~10 times, marked out by the vertical gray boxes in Figure 12. Linear1 (the blue curve) successfully predicts the onsets of all major MeV electron events at this L-shell, indicated by the leading edges of significant sudden increments in fluxes fallen within the boxes with a width of 25 hr (also called prediction windows). In comparison, although the four models (particularly the LinearReg model in green) often predict onsets earlier than linear2, they only successfully predict eight of them (those marked with green letter Y), fail seven, and have one event barely making inside the prediction widow. In other words, 1-day forecasts from PreMevE 2.0 predict the onsets at L = 4.5 with a success rate below 50%, which is better than linear2 but far behind linear1.
Two-day forecasts are also presented in Figures 13 and 14. Again, forecasted results at three L-shells in Figure 11 closely trace the target, similar to Figure 11. Interestingly, for all 16 selected major MeV electron events in Figure 14, the onsets of 11 events are successfully predicted by the four models at L = 4.5, while the failed events decrease to 4. This increases the success rate of onset prediction to~70%. Judged from this number and above results, this new PreMevE 2.0 model is able to combine the advantages of both linear1 and linear2 by not only predicting the arrivals of new MeV electrons but also specifying evolving flux levels closely, which is an encouraging progress.
Results from LinearReg and LSTM models at GEO are specifically presented in Figure 15 for both 1-and 2day predictions. For 1-day forecasts in the top three panels, it can be seen that fluxes from LinearReg (green) Figure 13. Two-day forecasts are compared to target fluxes at three selected L-shells over the combined validation and test period. Same format as Figure 9. Note here linear2 is for 1-day forecasts instead of 2-day. and LSTM (purple) trace observations (black) more closely than linear2 (yellow), consistent with their higher PE values as shown in Table 2. Also, forecasts from LinearReg and LSTM appear to predict the onsets of MeV electron events at about the same level as linear1, by comparing the leading edges of flux spikes in those curves. For 2-day forecasts, LinearReg PE value in Table 3 suggests that 2-day forecasts from LinearReg model are close to 1-day forecasts from linear2, which can be seen from the entangled LinearReg and linear2 curves as in Panels D-F. Forecasts from the LSTM model is not as good, although they still capture the general trend of 1-MeV electrons at GEO.  Figure 16 shows the Spearman correlation of the input data (E2, E3, P6, and SW speed) and the target (1-MeV electrons) for three selected L-shells using different time lags. Spearman correlation does not assume that the data follows a particular distribution, so it is a non-parametric measure of monotonic relationship. The results in Figure 16 show that the Spearman correlation between input and target decays with longer time lags. The correlation remains stronger for longer periods at inner L-shells (i.e., longer memory) and decays faster for outer L-shells (shorter memory). We also note the correlation between SW speeds and target gets more significant when moving to outer L-shells, which is consistent from our discussions in section 4.1. Curiously, the shape of the correlation curve of E3 is similar to the shape of P6, whereas the shape of E2 is similar to the shape of SW. All these suggest more robust models can be elaborated with a variation of inputs (window sizes and parameter combinations) for different L-shells. In fact, Figure 5 shows that E2 + SW are apparently the best combination of input for outer shell prediction, whereas other combination of input presents stronger values of PE for inner shells. Given a threshold value of~0.4 for significant correlation, it is seen from Figure 16 that a fixed window size for all L-shells may range from >~14 (to include the maximum correlation values) up tõ 20 (to avoid too long history).  Previously, Chen et al. (2019) used 300 time bins to train linear1 and lin-ear2 submodels to forecast MeV electrons. Table 1 shows that linear1 and linear2 models have a weaker forecasting performance than the LinearReg models trained with the similar inputs. The difference in performance can be explained by the fact that a much larger training set incorporate a wider flux variation that can be helpful to train the models.
Besides, the addition of SW speeds definitely helps improve the performance of linear models at large L-shells, which is consistent with Baker et al.'s study (Baker et al., 2019) that confirms the importance of solar wind speeds for MeV electron flux levels.
In this work, we have performed tests on the number of units, types of activation functions, and number of layers; however, it is still possible that a more intensive artificial NN architecture testing will find a more appropriate model for MeV electron forecasting. Some of the venues not fully explored by the analysis performed for this work include different regularizations, the exploration of architectures meant to predict the entire range of L-shells (in contrast to the L-shell-by-L-shell forecasting as presented here), or different architectures for different L-shells. Additionally, the cross-shell correlation is only marginally explored and has the potential to help achieve better performance. Moreover, we expect that more available data can be useful to improve models' performance. We plan to test with observations over longer period as well as for higher energy electrons in the next step.

Summary and Conclusions
This new PreMevE 2.0 model aims to forecast MeV electron distributions even with no in situ measurements available, for example, during the post-RBSP era, and it is designed to be driven by easily accessible inputs from long-standing satellite constellations in LEO and GEO as well as at the Lagrangian 1 point of Sun-Earth system. Meanwhile, deep learning algorithms have recently achieved new state-of-the-art accuracy in many problems partially due to the increase of observations. Therefore, it is reasonable for us to foresee an increase in both performance and use of deep learning model for MeV electron forecasting as more space weather data have been accumulated and made available.
In this work, we have tested (1) different model input parameter combinations and (2) four categories of supervised machine learning algorithms, with the goal of upgrading our predictive model for MeV electrons inside the Earth's outer radiation belt. This new PreMevE 2.0 model has been demonstrated to make much improved forecasts, particularly at large L-shells, by including upstream solar wind speeds to the model's inputs. Additionally, based on four categories of linear and artificial machine learning algorithms, a list of models were constructed, trained, validated, and tested with 42-month MeV electron observations from NASA Van Allen Probes mission. Model predictions over the 14-month long out-of-sample test show that, with optimized model hyperparameters and input parameter combinations, the top performer from each category of models has the similar capability of making reliable 1and 2-day forecasts with L-shell-averaged PE values of~0.87 and~0.82, respectively. Interestingly, the linear regression model is often the most successful one when compared to other models, which suggests the relationship between the dynamics of trapped 1-MeV electrons and precipitating electrons is dominated by linear components. It is also shown that PreMevE 2.0 can predict the onsets of MeV electron events in 2-day forecasts with a reasonable success rate of~70%. This improved PreMevE model is driven by observations from existing space infrastructure (including a NOAA LEO satellite, the solar wind monitor at L1 point, and one LANL GEO satellite) to make high-fidelity forecasts for MeV electrons and thus can be an invaluable space weather forecasting tool for the community. In the future, we plan to extend to PreMevE to higher energy electrons and also carry out more experiments including exploiting the cross-L-shell correlation thoroughly.