Medium Energy Electron Flux in Earth's Outer Radiation Belt (MERLIN): A Machine Learning Model

The radiation belts of the Earth, filled with energetic electrons, comprise complex and dynamic systems that pose a significant threat to satellite operation. While various models of electron flux both for low and relativistic energies have been developed, the behavior of medium energy (120–600 keV) electrons, especially in the MEO region, remains poorly quantified. At these energies, electrons are driven by both convective and diffusive transport, and their prediction usually requires sophisticated 4D modeling codes. In this paper, we present an alternative approach using the Light Gradient Boosting (LightGBM) machine learning algorithm. The Medium Energy electRon fLux In Earth's outer radiatioN belt (MERLIN) model takes as input the satellite position, a combination of geomagnetic indices and solar wind parameters including the time history of velocity, and does not use persistence. MERLIN is trained on >15 years of the GPS electron flux data and tested on more than 1.5 years of measurements. Tenfold cross validation yields that the model predicts the MEO radiation environment well, both in terms of dynamics and amplitudes o f flux. Evaluation on the test set shows high correlation between the predicted and observed electron flux (0.8) and low values of absolute error. The MERLIN model can have wide space weather applications, providing information for the scientific community in the form of radiation belts reconstructions, as well as industry for satellite mission design, nowcast of the MEO environment, and surface charging analysis.


Introduction
The Van Allen radiation belts, discovered by the array of Explorer satellites (Van Allen & Frank, 1959), are zones of charged energetic particles, mainly electrons and protons, trapped by the magnetic field of the Earth. Energetic electrons (>100 keV) are mainly confined to two regions-the inner belt, within L from 1.2 to 2.5, and the outer belt located between L from ∼3 to 7 (Lyons et al., 1972;Summers et al., 2004). The inner and outer electron radiation belts are separated with a so-called slot region, usually devoid of energetic electrons (Kavanagh et al., 2018;Lyons & Thorne, 1973). The inner radiation belt is known to exhibit long-term stability, while the outer belt is highly dynamic and depends substantially on solar activity (Meredith et al., 2006).
The dynamics of the outer radiation belt is governed by a complex interplay between acceleration and loss processes (Reeves et al., 2003). Electrons with energies of tens of keV, called the source population, are injected into the inner magnetosphere during substorms and produce waves, for instance the whistler mode chorus (e.g., Boyd et al., 2014Boyd et al., , 2016Jaynes et al., 2015). Electrons with energies of hundreds of keV, called the seed population electrons, are also injected in the magnetosphere during substorm activity. These electron populations can accumulate at the surface of the spacecraft and lead to satellite loss due to the so-called surface charging effects (e.g., Garrett, 1981;Lanzerotti et al., 1998). Furthermore, the seed population electrons can be accelerated to relativistic energies by waves. These relativistic (>1 MeV) particles can penetrate through satellite shielding and damage the equipment onboard, also leading to satellite loss (e.g., Fennell et al., 2000).
To date, there are more than 2,200 operational satellites in the Earth's orbit, and many of them systematically pass through the radiation belts region. Approximately 1,400 spacecraft are in low Earth orbit (LEO) at altitudes up to 1,000 km. The LEO satellites cross the inner radiation belt in the region of the South Atlantic magnetic anomaly (SAA) and the outer belt at higher latitudes. The second most populated is the geostationary orbit (GEO) with more than 560 satellites flying at altitudes of ∼36,000 km synchronously with the rotation of the Earth. GEO satellites generally fly close to the outer edge of the outer radiation belt (L ∼ 6.6). Satellites flying below GEO and above LEO follow the so-called medium Earth orbit (MEO). Many Global Navigation Satellite System (GNSS) satellites fly at MEO, for instance, the Global Positioning System (GPS), Global Navigation Satellite System (GLONASS), and Galileo. Furthermore, in order to reach GEO, an increasing number of spacecraft are using the electric orbit raising method and can spend hundreds of days in the MEO region Horne & Pitchford, 2015). Satellites following MEO systematically pass through the heart of the outer radiation belt and are exposed to the largest values of electron flux. The number of satellites in the Earth's orbit will increase significantly in the following years, and in order to ensure the long-term satellite operation stability, it is necessary to have reliable models of electron intensities at different energies (from tens of keV up to several MeV) and locations.
The existing radiation belt models can be divided into three main categories: physics-based, data-driven, and data assimilation models. Several physics-based models of electron flux have been created for the radiation belts and ring current region. Among them, there are the Versatile Electron Radiation Belt (VERB) (e.g., Subbotin & Shprits, 2009), British Antarctic Survey Radiation Belts Model (BAS-RBM) (Glauert et al., 2014), and Dynamic Radiation Environment Assimilation Model (DREAM) (Reeves et al., 2012;Tu et al., 2013Tu et al., , 2014 codes based on solving the three-dimensional Fokker-Planck equation to reproduce the dynamics and variability of the MeV radiation belts electrons. The physics-based models typically include the radial diffusion, losses due to pitch-angle scattering, and magnetopause shadowing . Recently, the VERB-4D code has been developed to extend the VERB code to lower-energy ring current electrons by including advection terms (Aseev et al., 2016;Shprits et al., 2015). The physics-based Inner Magnetosphere Particle TrAnsport Model (IMPTAM) was developed and shown to give reasonable flux predictions at energies from several eV up to <150 keV (Ganushkina et al., 2019). The low energy electrons are also modeled by the coupled Fok ring current (FRC) (Fok & Moore, 1997) and comprehensive inner-magnetosphere ionosphere (CIMI) (Fok et al., 2001) models operating online. It should be noted that there is generally a gap in modeling the electron flux at energies 100-600 keV. Physical modeling at these energies is considered difficult due to the fact that electric field effects have to be considered (Ganushkina et al., 2011) and also because the physics governing the dynamics of electrons at medium energies is not entirely understood (Horne et al., 2013).
The data-driven models can be subdivided into static and dynamic. AE8 (Vampola, 1997) and AE9 (Ginet et al., 2013) are examples of static models providing the values of electron intensities from 40 keV up to ultra-relativistic energies. AE8 and AE9 models overcome limitations of the individual data sets by combining large scale statistics and are currently used as a reference for engineering purposes . Dynamic data-driven models typically depend on a combination of solar wind parameters and geomagnetic indices. Several data-driven models have been developed for the GEO orbit. Denton et al. (2015) developed an empirical model of electron flux for low energies (10 eV to 40 keV), driven by the Kp index, based on data from 82 LANL satellites. Later, the upstream solar wind conditions were incorporated into the model , and it was also expanded to 6-20 R E (Denton et al., 2019) using Cluster data. For relativistic energies, Balikhin et al. (2011) employed a nonlinear autoregressive moving average with exogenous inputs (NARMAX) technique to predict daily flux at GEO orbit at energies 800 keV and 2 MeV, using solarwind parameters and the previous time-series of daily flux from GOES satellites. The NARMAX model was further extended to a broader energy range, including electron flux at energies of hundreds of keV (Boynton et al., 2016). However, the model needs satellite time-series as inputs and therefore for now is confined to the geostationary orbit. Several models of the relativistic electron flux are based on the artificial neural networks (ANN), for example, Ling et al. (2010), Kitamura et al. (2011), and de Lima et al. (2020. Other empirical models of relativistic flux developed for the GEO region include the relativistic electron flux model (REFM) driven by solar wind velocity (Baker et al., 1990), an empirical function D 0 dependent on several solar wind parameters and Kp (Li, 2004), and linear regression models (e.g., Osthus et al., 2014;Sakaguchi et al., 2015;Simms et al., 2014Simms et al., , 2016 which take as inputs the solar wind parameters and previous values of flux at GEO. Tsutai et al. (1999) used linear filter to predict the values of >2 MeV flux at GEO 1 day ahead using GOES magnetic field data over the preceding 6 days.
Although a variety of models have been developed for the geostationary orbit, few data driven models exist that give reliable electron flux predictions at MEO. One of the recently developed models of electron flux at MEO working with a daily cadence is the SHELLS model (Claudepierre & O'Brien, 2020) which takes as input the LEO electron flux values and Kp index and returns the flux along Van Allen Probes orbit at energies 0.35-1 MeV. The general lack of models at MEO comes from the fact that many GEO satellites provide continuous high quality observations of electron flux, while at MEO, the temporal and spatial coverage of observations remains rather sparse (Sakaguchi et al., 2015). Indeed, only ∼100 satellites reside in MEO, and only few of them provide measurements of the radiation belt populations. Among other data sets of electron flux measurements in the MEO region, the recently released GPS energetic particle data have notable advantages such as for instance, the large number of satellites (23) and uniform MLT coverage, as well as availability of 18 years of observations covering almost two solar cycles. Furthermore, most of the GPS satellites carry onboard identical Combined X-ray Dosimeter (CXD) detectors measuring electron flux at energies 0.12-10 MeV. The GPS/CXD data have been inter-calibrated with Van Allen Probes electron flux measurements, and the two missions were in good agreement at energies below 4 MeV (Morley et al., 2016).
In the current paper, we present the data-driven Medium Energy electRon fLux In Earth's outer radiatioN belt (MERLIN) model, based on machine learning. For model training, we employ the Light Gradient Boosting Machine (LightGBM) algorithm, which is known for its high efficiency and accuracy (Ke et al., 2017). The model takes as input satellite position in an Lshell -MLTmagnetic latitude (MLat) frame, solar wind parameters with history of velocity, and geomagnetic indices. The model returns values of locally omnidirectional electron flux averaged over the viewing angle of the instrument at energies 120-600 keV as outputs. The structure of the paper takes the form of five parts, including this introductory section. Section 2 describes the data set used for model construction. Section 3 is concerned with the methodology used for this study. Section 4 presents the results. The conclusions are drawn in the final section.

GPS Electron Flux Data
The GPS spacecraft are distributed across six orbital planes, nominally inclined at 55°. The satellites follow near-circular medium Earth orbit, with 12 hr revolution period, at a constant altitude. As of 2020, the constellation consists of 74 spacecraft, of which 31 are operational, nine reserve, two being tested, and 32 no longer in use (www.gps.gov). Due to its fixed altitude of 20,200 km (R ∼4.2), the GPS constellation travels through a range of L-shells providing the particle measurements in the outer radiation belt. We note, however, that the inclination of the GPS orbit restricts the range of equatorial pitch angles as a function of L-shell. The satellite at off-equatorial magnetic latitudes (MLat) cannot observe the particles mirroring at lower MLats and therefore samples only a part of the equatorial pitch angle distribution.
Since the early 2000s, newer GPS satellites are equipped with either of the two instrument series: the improved BDD-IIR or the CXD. Most of the satellites currently carry aboard the identical CXD detectors. Their response is well known and their electron flux data have been used in previous radiation belt studies 10.1029/2020SW002532

Space Weather
SMIRNOV ET AL.
(e.g., Olifer et al., 2018;Pinto et al., 2020). The CXD instruments measure the electron flux using two sensors: the low energy particle (LEP) and the high energy particle (HEP) sensors, with the typical sampling rate of 240 s (Morley et al., 2016). In the present study, we use data of the first five evaluated CXD energies, namely, 120, 210, 300, 425, and 600 keV. As of 2020, 21 GPS satellites are equipped with the CXD detectors, providing more than 200 years of satellite data. The CXD measurements were previously cross-calibrated with the Van Allen Probes electron flux and showed good agreement. Morley et al. (2016) analyzed the response functions for different GPS satellites and found a small systematic difference between satellites up to ns-62 and satellites with designators greater than ns-62. To minimize these differences and cross-calibrate against the Van Allen Probes ECT suite in 2012-2014, the GPS instrument response functions were updated. The publicly released data product uses the updated response functions. The GPS/CXD data showed a good agreement with Van Allen Probes for energies of up to 4 MeV, while at higher energies larger variance was observed due to the unaccounted instrumental backgrounds.
The present study is based on 17 years of GPS/CXD electron flux measurements at energies 120-600 keV.
The flux values were first cleaned using the "dropped_data" flag, and outliers in the data were removed by setting the minimum allowed flux values to 1 cm −2 keV −1 s −1 sr −1 . Furthermore, in order to remove the unrealistic values coming from the forward flux retrieval procedure, we use a new fit quality flag defined as F ¼ maxðjlog 10 predictedcounts observedcounts jÞ; (1) where predicted and observed are arrays of the counts for the first five energies ("model_counts_elec-tron_fit" and "rate_electron_measured" products in the data files, respectively). The flag values corresponding to low quality fits were selected to be F > 0.11, and such data entries were removed from the data set.
We use data from 20 satellites (ns53-ns59 and ns61-ns73) carrying the CXD detectors (for details, see Carver et al., 2018). It is of note that the data from ns60 were not used due to the intermittent quality issues. We train the model on data with the original cadence of measurements equal 240 s. Before fitting the model, we applied the base 10 logarithm to the GPS electron flux, as the data variance can be up to several orders of magnitude.

Solar Wind and Geomagnetic Indices
The relationship between the electron flux intensities in the outer radiation belt and solar wind parameters has long been recognized (e.g., Paulikas & Blake, 1979;Reeves et al., 2011). Numerous studies have analyzed contributions of the solar wind parameters to flux enhancements. The independent contributions of solar wind velocity and number density were investigated, for instance, in Balikhin et al. (2011), Kellerman and Shprits (2012), and Simms et al. (2014). A combination of velocity and density, pressure and geomagnetic indices, combined with the previous daily flux value, was examined by Sakaguchi et al. (2015). Long-term relationship between velocities and MeV electron fluxes was discussed in Reeves et al. (2013).
It has been well established that the radiation belt flux enhancements are connected with changes in solar wind speed. Reeves et al. (2011) analyzed the relativistic electron flux at GEO with respect to the solar wind speed and noted that the resulting distribution resembled a triangle. Such a shape was explained as follows.
The V sw values rarely fall below 300 km/s, and this leads to a left-hand side of the triangle (see also Wing et al., 2016). The top side of the triangle forms due to the fact that the flux values seem to have a sharp maximum at higher V sw , for which multiple explanations have been given. One of the most puzzling features of the triangular distribution is that the variability of electron flux at lower V sw is much larger than at higher V sw . Reeves et al. (2011) noted that the electron flux can exhibit large values under any V sw values. The triangular form demonstrates that using the values of the solar wind velocity and density is not enough to fully explain the variability of flux, and therefore, other parameters have to be taken into account.
We consider the following solar wind parameters and geomagnetic indices obtained at the OMNIWeb database (omniweb.gsfc.nasa.gov). First, among the solar wind drivers, we analyze the solar wind velocity, and its components V x , V y , and V z . We analyze the IMF magnitude, B x , B y , and B z components, and also solar wind density n sw . We employ the derived solar wind quantities: magnetosonic and Alfvenic Mach numbers (Mach_a and Mach_m, respectively), solar wind temperature T sw , electric field (v*B z ), dynamic pressure 10.1029/2020SW002532

Space Weather
SMIRNOV ET AL.
(Pdyn), and plasma Beta. From geomagnetic indices, we select SYM-H, SYM-D, ASYM-H, and ASYM-D indices, planetary Kp index and auroral AL, AU, and AE indices. It has been previously established that many of these features are, in fact, correlated with one another. In Figure 1, we show the Pearson linear correlations between different solar wind and geomagnetic parameters in order to examine which features can be used for the model setup.
We find that V x is perfectly anti-correlated with V, with the −1.0 coefficient, which is as expected since V x constitutes most of the V amplitude. V y correlates with V with only 0.19 correlation, and V z shows zero correlation with velocity magnitude. SYM-D does not correlate with any of the features, except for very weak (0.14) relationship with SYM-H. SYM-H, on the other hand, correlates with several parameters. For instance, it exhibits a moderate positive correlation with the solar wind velocity (0.43) and negative correlation with ASYM-H (−0.6) and ASYM-D (−0.47). Furthermore, it correlates with auroral geomagnetic indices with approximately 0.5 correlation coefficient and is also weakly anti-correlated with the electric field (v*B z ). ASYM-D index is correlated with ASYM-H with the R value of 0.65, and also with auroral indices with the absolute value of the linear correlation ∼0.6. In turn, ASYM-H index shows weak linear correlation with solar wind velocity (0.32), IMF (0.49), and dynamic pressure (0.36) and exhibits higher correlation with the auroral indices with the corresponding R values of up to 0.7. Kp index exhibits moderate positive correlation with the IMF magnitude (R = 0.56), solar wind velocity (0.59), temperature and dynamic pressure (R ∼ 0.5), stronger positive correlation with AE (R = 0.7), and the corresponding AU and AL indices, along with 0.5 correlation with SYM-H. It should be noted that the Kp index has a 3-hr cadence and therefore shows lower correlation with AE than one would expect. By averaging the AE index to the same 3-hr cadence, one obtains a correlation of 0.82. IMF magnitude shows weak correlation (0.3) with solar wind density and temperature, moderate (0.5) correlation with dynamic pressure and Mach numbers.
We note that although the gradient boosting regression trees are prone to the multi-collinearity of features (e.g., Ding et al., 2016;Maloney et al., 2012), using highly correlated inputs can pose a disadvantage for machine learning studies. For example, when two parameters are correlated, we can achieve the same reduction in variance as by using only one of them. Here we remove several correlated and derived quantities leaving the more in-depth analysis of the influence that different parameters have on the electron flux for further studies. First, we exclude directional components of magnetic field and velocity. Furthermore, we exclude all of the derived quantities, because they encompass information of their original constituent variables (for instance, dynamic pressure strongly correlates with density). Magnetosonic and Alfvenic Mach numbers essentially represent the normalized velocity, and it is enough to consider the velocity itself. We apply the same reasoning to the geomagnetic indices selection: AE is a product of AL and AU indices and also correlates with them, which is why we only use AE for model setup. AE and Kp indices are generally strongly correlated. While Kp is a measure of the planetary geomagnetic activity, substorms are better resolved by the AE index. Furthermore, Smirnov et al. (2019) reported the long-term positive correlation of electron flux at energies up to 400 keV with AE index along the solar cycles 23 and 24. The same conclusion was drawn in Smirnov et al. (2020) by analyzing the long-term phase space density (PSD) variations of low-μ electrons, indicating the importance of the substorm activity for radiation belts transport processes. For these reasons, we include both Kp and AE indices as inputs.
The inner edge of the outer belt is highly dynamic and can move inwards during slot-filling events and outwards during the quiet periods. Li et al. (2006) reported a correlation between the 30-day averages of the innermost edge of the outer belt and the plasmapause location (Lpp) using 12 years of SAMPEX data. The flux values in the slot region, located below the Lpp, are lower than those beyond the plasmapause due to loss processes attributed to storm-enhanced EMIC and plasmaspheric hiss waves (Li et al., 2006). O'Brien and Moldwin (2003) presented a model of the plasmapause location parametrized as a function of the maximum AE value over the preceding 36 hr. Furthermore, the Lpp model based on the AE index was found to perform better than that using the Dst values. In order to account for the dynamics of the inner edge of the outer belt, we include the maximum value of the AE over 36 hr as an input parameter. We do not apply the linear regression coefficients to convert the max(AE) to Lpp, as the regression trees are invariant to linear scaling operations (e.g., Druzhkov et al., 2011).
After the enhancement events, the flux of medium energy electrons decay to their pre-storm values gradually over a period of up to 20 days (e.g., Meredith et al., 2006). Such a slow decay can be explained by the longer hiss lifetimes, which by different estimates vary from several up to tens of days (Orlova et al., 2016). Hence, it is crucial to include some indication of the previous state of the radiation belts into the model. This is usually done by adding the preceding values of flux as model inputs (e.g., Boynton et al., 2016;Simms et al., 2016Simms et al., , 2014. Instead, in the MERLIN model, we include the history of solar wind velocity as a proxy of the previous activity. We use the averaged v sw values over the preceding 1, 2, 3, 6,9,12,15,18,21,24,30,36, and 42 hr and 2, 3, 7, and 14 days. It should be noted that the averages over longer periods of time also carry part of the information from shorter scales averages. In this study, we only consider the history of solar wind velocity, while adding the history of number density leads to overfitting, as discussed in Section 4.1. In sum, as input parameters, we select the satellite position in the L-MLT-magnetic latitude frame, geomagnetic indices SYM-H, Kp, and AE, solar wind parametersnumber density, electric field (v*B z ), and velocity with 2 weeks of history (see also Figures 2 and 4). L-shell values were calculated using T-89 model with the internal field given by IGRF. More features can be incorporated into the model in the future; however, preserving the methodology and the more refined feature selection will be performed in a separate study.

Methodology
Large and growing volumes of data have been provided by the satellite missions in the Earth's radiation belts region. One of the efficient ways to utilize these long-term data sets for modeling is to apply machine learning (ML) techniques. Over the years, machine learning has found numerous applications in the field of space physics research. ML methods have been employed for the geomagnetic indices forecast (e.g., Bala

Space Weather
SMIRNOV ET AL. (Bortnik et al., 2018;Zhelavskaya et al., 2017), and solar activity prediction (e.g., Colak & Qahwaji, 2009). Prediction of the electron flux in the outer radiation belt remains one of the most challenging tasks in the space weather research (Camporeale, 2019). In the present study, we use the Light Gradient Boosting approach, described in detail below, to predict the flux of medium energy electrons using the GPS particle data.

LightGBM
One of the predictive approaches widely used in machine learning is the so-called gradient boosting decision tree (GBDT) method. The GBDT algorithms gained popularity for being efficient, highly accurate, and interpretable. GBDT is an ensemble model of usually shallow decision trees, also called weak learners, trained in sequence (Friedman, 2001). Growing each individual tree starts from the source set contained in a root node of the tree (shown in Figure 2). When a split is made, the root node is divided into two subsets, and two branches are generated (Ren et al., 2019). The procedure is repeated recursively until either the subset at each node contains all identical values of the target variable, or when the splitting is constrained by the algorithm's hyperparameters (e.g., max_depthor num_leaves is reached). The GBDTs are grown iteratively, and each new tree fits the residuals of the previous iteration to account for the mis-modeled instances (Freund et al., 1999). GBDTs have been applied successfully to many machine learning problems, performing well for regression and classification tasks alike. Numerous GBDT implementations have been developed, starting from adaptive boosting (AdaBoost- Freund et al., 1999). One of the most popular gradient boosting methods up to date is the Extreme Gradient Boosting Machine (XGBoost) (Chen & Guestrin, 2016), famous for winning machine learning competitions and outperforming even the deep learning neural network (NN) models on tabular data. Even though the gradient boosting methods are capable of giving high quality predictions, their main limitation is the unsatisfactorily long training time and poor scalability (e.g., Zhang et al., 2019).
The main cost in GBDT lies in learning the decision trees, and the most time-consuming part in learning each tree is finding the optimal segmentation points. In the conventional algorithms, this is usually done using the so-called pre-sorted algorithm. This method enumerates all possible split points on the pre-sorted feature values. While being simple and effective, this method is also known to be inefficient in training speed and memory consumption (Ke et al., 2017). Another more recent method is the histogram-based approach. It divides the continuous features into k intervals and selects the split points from those k values (Ju et al., 2019). Such an approach also has regularization effect and helps prevent overfitting.
One of the GBDT implementations, called LightGBM, has been recently developed by Microsoft (Ke et al., 2017). LightGBM addresses the traditional GBDT performance issues by first, using the histogram approach to find segmentation points and second, by utilizing a different approach to the tree growth. The conventional gradient boosting and also other tree-based methods such as random forest grow the trees level-wise. This means that when it is necessary to make a new split, a new level of leaves will be grown. In contrast, the LightGBM method grows the trees leaf-wise which adds only one more leaf and not level when a split is made. Such an approach leads to a much faster and less computationally expensive implementation of the gradient boosting (Ke et al., 2017). It has been demonstrated that LightGBM can be as much as 20 times faster than XGBoost, while reducing more loss.
The objective of this study is to predict values of electron flux at a range of L-shells throughout the outer radiation belt. Since the quantity being modeled can be represented by real numbers, we construct a regression gradient boosting model. The model set up is described in detail in section 3.3.

Test-Train Splitting of the Data
Any supervised machine learning model learns on the so-called training data set. The training set is seen by the model and usually contains most of the employed data points. Machine learning algorithms are trained iteratively, trying to reduce the cost function value at each training iteration (Camporeale, 2019). At some point, the model learns not only the useful dependencies from the data but also the noise due to reducing the cost function to extremely low values, which results in over-fitting. The performance of the model cannot be adequately evaluated on the data that were used to train it. Therefore, another set is needed to give an unbiased estimate of model performance and also for tuning the hyperparameters. This second set is called a validation set. Since these data had not been seen during training, the loss function value would decrease only in the case when the model captured the underlying general dependencies from the data. However, it has to be noted that since the loss function is being routinely evaluated on the validation set, the model occasionally sees it as well, although never learns from it. Therefore, it is essential that after training the model and checking its performance on the validation set, the model be evaluated one last time on the fraction of data that had never been used before, neither for training nor validation. The data set used for this purpose is called a test set. In many machine learning competitions (e.g., Kaggle), only the training and validation sets are given to the competing teams, while the test set is released after the model submissions and decides the winner.
Multiple ways have been proposed to separate the data set into the train, validation, and test parts. In fact, very different strategies can be applied based on the field and objective of the study. For instance, in social sciences, the accepted methodology would be to use the random test-train split, that is, when points for train, validation, and test sets are selected randomly from the original full data set. This, however, should not be applied for modeling the time-series and physical processes, due to the fact that the validation points (usually 10-20% of values) can then be obtained by linearly interpolating the typically larger training set. Such a selection technique might introduce what is referred to as the data leakage (e.g., Camporeale, 2019). Hence, it is important to validate and then test the model on the eventsunseen during training. For that purpose, we should select the consecutive time intervals for validation and testing. One of the ways to achieve this is to use, for example, first 80% of data for training, the next 10% for validation, and then test the model on the last 10% of the data. However, in case of magnetospheric phenomena, we have to take into account the solar cycle evolution of the processes we model. Indeed, the radiation belt dynamics during the descending and quiet phases of the solar cycle are vastly different. Therefore, the model will be trained on some part of the solar cycle and validated on another, which would not yield an adequate estimate of its performance. In order to take the solar cycle dependence of the radiation belt dynamics into account, we perform a tenfold cross validation (CV), described below.
We first reserve >1.5 years of data (March 2016 to January 2018) for testing the model. This is done due to the fact that on one hand, this interval had numerous solar wind enhancement events and also had enough data points (∼1.100.000) for such evaluation. The entire data set consists of ∼5.5 million data points, and therefore, approximately 20% are reserved for test set.
The rest of the data set is used for the K-fold CV, illustrated in Figure 3. The data are divided into K roughly equal parts (in our case, K = 10). At every such split, the model is fitted on the (K-1) parts and validated on one part that was left out. For instance, during the first split, we use the first fold for validation and fit the model on the rest of the data. This splitting process is repeated K times, each time selecting a different interval for validation. At each split, the previous model is being discarded and a new one is being trained and validated on the newly selected data. The K-fold allows one to utilize all observations for training and evaluating the model, and each of the data points is used for validation only once. The K-fold CV is used to 10.1029/2020SW002532

Space Weather
SMIRNOV ET AL.
optimize the hyperparameters and also to retrieve the accuracy of the model, averaged over different phases of the solar cycle. The final model is then trained on the entirety of the training and validation sets.

Model Setup
In this study, we use the LightGBMRegressor method as implemented in the Python version of lightgbm library (Ke et al., 2017).
LightGBM has a variety of algorithm parameters, also called hyperparameters, that can have a non-negligible effect on the model performance. Several of them have to remain fixed, while others need to be optimized to achieve higher accuracy. Some of the parameters that do not change include the objective (in our case, regression), the booster method (we use gbdt, although we note that other possibilities are available using the novel DART and GOSS methods; Ke et al., 2017), metrics of the loss function values (we select both L1 and L2 metrics, representing the MSE and MAE, respectively), and the early_stopping_rounds parameter which is used to stop the model training once overfitting occurs. The learning_rate controls how much the model is adjusted on each iteration. In case of the high learning rate, the algorithm makes faster fits that can cause overfitting, while in case of extremely low values, the training speed drops sufficiently and more iterations are needed to reach convergence. We select the learning_rate of 0.05 and keep it fixed throughout further tuning. Other parameters need to be optimized. The most sensitive hyperparameter is the maximum number of leaves in a tree, or num_leaves. Deeper trees have better learning capabilities, but since the gradient boosting model represents an ensemble of weak learners, the num_leaves is usually not very high, in our case ranging from 15 to 250. Another important parameter is the minimum number of data points in leaf (min_data_in_leaf), which has a regularization effect and stops the model from learning the noise. However, the large values can cause decrease in model accuracy. One can also use the subset of the input features for training each individual tree by setting the colsample_by_tree value. To optimize the said hyperparameters we use the hyperoptPython library (Bergstra et al., 2013) which employs the Tree of Parzen Estimator (TPE) approach. The resulting hyperparameters values, as well as their search domains, are given in Table 1.

Feature Importance
One of the advantages of the tree-based machine learning methods, LightGBM included, lies in the fact that the resulting model can be easily interpreted. Every tree comprising the model can be visualized and can give ... Figure 3. Schematics of the K-fold cross validation (CV). The last 1.5 years of data are reserved as the test set, never to be used during training and validation. The rest of the data are then divided into K equal parts (here, K = 10) and at every split the model is trained on nine parts and validated on one part. The training process is repeated 10 times, each time withholding a different set for validation. Thus, the model uses all observations for training and validation, and each of the data points is used for validation only once. The final evaluation is performed on the test set.

Space Weather
SMIRNOV ET AL. direct information about the individual split gains, internal values in each leaf, and the decision making process in general. In practice, it is often not possible to analyze the model in this way due to the large number of trees in an ensemble (in our case, ∼300). The insight on the model construction can then be obtained indirectly, for example, by analyzing the importance scores of each variable, also called feature importance. They are computed for each GBDT and then averaged across all of the trees forming the model. There are multiple ways to retrieve the feature importance. LightGBM utilizes the so-called split or gain methods. The split method counts the number of times each variable was used to make a split. The gain method summarizes all gains of splits which use each of the features. It has been well established that the two methods can, in fact, give different results, and also that feature importances estimated this way only carry information about how the particular model was constructed, rather than physical meaning. Furthermore, removing one of the features can potentially redistribute its feature importance between several other variables and yield a different result altogether. Other methods, which are more stable, include the mutual information criterion, permutation method (Auret & Aldrich, 2011), and the recently developed Shapley values technique (Lundberg & Lee, 2017). These methods require an in-depth analysis which is beyond the scope of the present paper and will be evaluated in future studies. In the present section, we confine to describing of the key attributes of the MERLIN model. Feature importance estimated using the split and gain methods are shown in Figure 4. MERLIN uses the satellite position (L-shell, magnetic latitude, MLT), as well as values of the SYM-H, Kp, and AE indices and solar wind density (n sw ), IMF, electric field (El_field), and solar wind velocity. Furthermore, the time history of velocity is incorporated into the model in the form of averages over the certain time intervals. We use the progressively increasing time steps, from 1 hr (1h) up to 3 days (3d) and also the averages over 1 and 2 weeks (2w). Note. The first five parameters were kept fixed, while the next ones were optimized using HyperOpt.

Space Weather
We find that the most important features by gain are the L-shell and maximum of AE over 36 hr, which is a proxy of the plasmapause location ( Figure 4). Indeed, the values of electron flux depend on these quantities to a large extent, due to the fact that the electron intensities are higher in the heart of the outer belt and then decrease to the outer edge of the belt and also in the opposite direction towards the slot region, which is reflected in the L-values. The importance of the plasmapause location has been discussed in Li et al. (2006), and it was shown that Lpp correlated with the inner edge of the outer belt. The flux values drop significantly below the Lpp, although it is of note that the relationship between the innermost location of the outer belt and Lpp is energy dependent (e.g., Reeves et al., 2016;Ripoll et al., 2016). Among the instantaneous values of solar wind and geomagnetic indices, the SYM-H index, which is a proxy of geomagnetic storms, shows the most importance in both split and gain methods. We also note that the time history of solar wind velocity plays an important role. For the 210 keV electrons, the average velocity over 6 hr and also over 1 and 2 weeks have the most importance based on impurity reduction (Figure 4). The importance of the 2 weeks average of v sw likely comes from the fact that following the flux enhancement events, velocity drops to quiet time values faster than the electron intensities which remain at elevated levels for longer periods of time. Using CRRES data, Meredith et al. (2006) demonstrated that the flux values elevated by the substorm or storm processes decay to their pre-storm values gradually and that the flux can fall by two orders of magnitude over a period of approximately 20 days. Therefore, an indicator of the past events is needed in order to correctly reproduce the dynamics of the flux decay. In the MERLIN model, we do not use any previous values of flux, and hence, it is the history of velocity that the model uses as an indicator. We note that AE and Kp indices exhibit very close feature importance values, because being highly correlated with one another, they produce equal reduction in variance when making splits. Figure 5 shows the influence of the solar wind history on the model performance on the example of 425 keV electrons. The MSE of the model with no solar wind history employed is shown as black dots. We further add the history of solar wind velocity of up to 14 days with gradually increasing time steps. The MSE decreases as more history is included, both on training and validation data. On the other hand, including also the time history of solar wind density decreases the training but not the validation error. In Figure 5, it can be seen that while the training MSE is lower at every time step when density is included, the validation MSE changes very slightly as compared to using velocity history alone and remains within the error bar, indicating overfitting. Furthermore, it was observed that adding averages of solar wind velocity of more than 2 weeks also results in overfitting. Therefore, we only use the history of velocity of up to 14 days as inputs.

Results of Tenfold Cross Validation
Multiple metrics can be used to evaluate the model accuracy (for details, see Liemohn et al., 2018;Morley et al., 2018). The LightGBM library offers a variety of metrics implemented for model analysis. The default metric is the mean squared error (MSE), which is computed at each training iteration. It should be noted that the electron flux can exhibit strong depletions, up to several orders of magnitude, over short times (Ganushkina et al., 2019). The mean squared error is susceptible to outliers, and therefore, we also evaluate the median of the squared error. In Table 2, it can be seen that both for training and validation sets, median of the squared error is approximately three to four times lower than the MSE. This means that while some of the rapid depletions/enhancements are not completely reproduced, the value of the median squared error of 0.06 shows that the model predictions are very close to the observed data.
Another useful metric, implemented in the LightGBM library, is the median absolute error, denoted as MAE. Evaluating the median error allows one to pay less attention to the outliers but gives a good proxy of model performance otherwise. In our case, MAE values are close for the train (0.19) and validation sets (0.26). These values of MAE mean that in general, the predicted values differ from observations by a factor of ∼1.5, which is considered acceptable for radiation belt modeling. The scale-dependent metrics can be difficult to interpret, since the model predicts base 10 logarithms of flux and also because the level of flux is generally different for different energies. We evaluate the normalized root mean squared deviation (NRMSD), defined in Equation 1 of Denton et al. (2019). The zero value of NRMSD corresponds to the perfect prediction, and the values of <1 generally indicate a good match between the model and data (Denton et al., 2019). From Table 2, one can see that the values of NRMSD are close to 0 (0.04 for train and 0.08 for the validation set), indicating that the model reproduces the observations very well.

Space Weather
The metrics mentioned above mainly quantify how far the predictions deviate from observations at stationary points. In order to evaluate how well the model reproduces the dynamical behavior of the electron flux, we employ the following metrics. First, we evaluate the correlation between the predictions and observations. The standard Pearson linear correlation coefficient is susceptible to outliers, and therefore, we use the Spearman rank correlation coefficient (ρ). From Table 2, it can be seen that the correlations are high both for train and validation sets, based on the tenfold CV. The average correlation on the train set is 0.89 and is slightly higher than that on the validation (0.85), which is as expected since the model was fitted on the train data. The high values of the correlation coefficient show that the model captures the behavior of flux at energies 120-600 keV, which are known to be very dynamic and can exhibit drastic changes on the order of several minutes.
Another popular metric widely used in machine learning is the r 2 score. This indicator is used to quantify the fraction of variance explained by the model (e.g., Morley et al., 2018). The average r 2 for the train data is 0.84 and 0.63 for the validation set. While these values appear low, we conduct a more detailed analysis of the r 2 on the synthetic data, shown in supporting information ( Figure S1). We first generate a harmonic function, a sinusoid, for 20 periods of 2π each, and then add random noise of maximum magnitude equal to 0.2. The electron flux are known to exhibit rapid depletions of up to several orders of magnitude, and to account for this, we add 60 dropouts where we subtract 2 units from the synthetic signal. We compute the values of the r 2 for the case with and without outliers. The initial sinusoid signal compared to the data with no outliers yields prediction efficiency of ∼0.9, while the data in presence of outliers give a lower r 2 value of ∼0.6, which is approximately equal to the value we have for MERLIN on the validation set. This means that the model can adequately reproduce the behavior of flux but miss some of the dropout magnitudes, Note. The standard error for the 10-fold CV are shown in brackets. 10.1029/2020SW002532

Space Weather
which results in the lower prediction efficiency. This will be further discussed while analyzing the performance on the test data. Figure 6 demonstrates the model performance, averaged over the tenfold CV, for individual energy channels. Figure 6a shows the median absolute error. It can be seen that the best performance is achieved for 425 keV electrons, although we note that the MAE for other channels is only slightly different. We find that the accuracy improves with higher energies (MAE for 120 keV on the validation set is 0.36 compared with ∼0.26 for 425 keV). The accuracy then slightly decreases from 425 to 600 keV electrons. The same can be seen in terms of the correlation (Figure 6b)-the averaged correlation on the validation sets is ∼0.80 for 120 keV and ∼0.85 for 300 keV. The ρ value then slightly decreases to 0.84 for 600 keV. The difference between Spearman correlation coefficient is generally very small between the energy channels (on the order of a few percent). It should be noted that the values of MAE and the correlation for each individual channel show that the MERLIN model well predicts both amplitudes and flux dynamics throughout the considered energy range.

Performance on Test Data
The train set is used to learn the model, and the model is constructed so as to minimize the error on the validation set. On each iteration, the MSE and MAE values should decrease for the train set as the gradient boosting model adjusts the residuals to fit the train set better, and then the updated model is evaluated on the validation set. If the validation error reduces, the training continues to a new iteration (i.e., grows a new GBDT). The model never learns from the validation set; however, it is still being used on every training iteration. We therefore need another totally independent set of values to check that the model generalizes well onto the unseen data. It is of note that this test set can only be used once to evaluate the performance, after the model training has been completed, and no further changes to the model can be made. More details on the train-validation-test splitting of the data can be found in section 3.1 and in Figure 3.
The values of different metrics discussed above are given in This means that the model successfully learnt the underlying relationships between the input parameters and the resulting electron flux on the training data, yields good accuracy on the validation intervals, and generalizes well onto the unseen data.  Figure 7b gives the flux values from the MERLIN model, and the difference between the predicted and observed flux is shown in Figure 7c. Figure 7e demonstrates the solar wind velocity for the test set. It should be noted that numerous solar wind velocity enhancements happened during the test interval with v sw rising up to >700 km/s. These events generally correspond to increases in electron flux, but it can be seen that after the velocity drops to its quiet time values, the flux remains elevated for longer periods of time. For instance, v sw increased up to 700 km/s at the beginning of September 2016 (detailed illustration is in Figure S2) and caused a rapid increase in electron flux by over an order of magnitude. The velocity then started decreasing and within 1 week dropped to 500 km/s while the flux remained at the elevated levels. Within a few days, velocity continued decreasing until it reached <300 km/s, but the electron flux had a much longer decay, and even after ∼1 week of very low velocities, the flux did not fall to the pre-event values. 10.1029/2020SW002532

Space Weather
In general, the model adequately reproduces all of the major flux enhancement events and also reproduces well the flux decay due to consideration of the velocity history. We note that for L-shells lower than 5, the flux enhancements are sometimes followed by periods of the increased data variance (Figure 7a). The most likely explanation lies in the fact that the GPS electron flux data are a derived quantity. As such, the fluxes are computed using a forward model combining three relativistic Maxwellians (in energy) and a Gaussian(in log of momentum) (see for details Morley et al., 2016). The electron energy spectra inside the plasmapause can have local peaks, and intense plasmaspheric hiss can generate a reverse spectrum in the energy range of hundreds of keV (Zhao et al., 2019). At lower energies, spectra are not always well captured due to the limitations of the spectral form (see section 3.2 of Morley et al., 2016), while this effect is not observed at higher energies. As expected, such events are not reproduced by the MERLIN model.
The Spearman correlation between the observed and predicted flux is approximately 0.81. It is, however, important that the model reproduces not only the flux along the GPS orbits but is also capable of giving reasonable prediction at fixed L-values. Figure 7d shows the GPS observations and the MERLIN output for L of 5.2. The correlation between the two is 0.76 and the mean squared error is 0.06. One can therefore conclude

10.1029/2020SW002532
Space Weather that the model generalizes well on the unseen data, both at a fixed L and along the orbit. The same conclusion can be drawn for the 600 keV population, demonstrated in Figure 8.
Figures 9a-9e show the occurrence density plots of the observed versus predicted flux for all five energies considered. Solid white lines show the one-to-one relationship between observations and predictions, and the dashed likes represent the flux deviating by a factor of 5. In general, the occurrence maxima follow the trend, and most of the points are within the factor of 5 from the trend line. Figure 9f provides an example of the histogram of model residuals for the 210 keV population. From the figure, it is evident that the model has very low bias of ∼0.05, depicted by a green line. Furthermore, the errors are normally distributed and the 10th and 90th percentiles are approximately ±0.51 and 0.52, respectively, meaning that 80% of the model residuals are within a range of ±0.5. Therefore, we can conclude that MERLIN predicts the electron flux in 120-600 keV energy range well, has low bias, and captures all of the general trends represented in the data.

Conclusions
A new data-driven model of the electron flux in the outer radiation belt is presented. The model uses satellite position and a combination of geomagnetic indices and solar wind parameters (with time history of velocity) in order to predict the flux values at energies 120-600 keV. The model has been trained and validated on more than 15 years of GPS electron flux data and tested on >1.5 years of observations. The tenfold cross validation shows that the MERLIN model predicts the MEO radiation environment well, capturing both the dynamics and amplitudes of electron flux. The results of the tenfold cross validation agree well with the evaluation on the test data meaning that the model is able to generalize well onto the unseen events. Predicted values of flux exhibit high correlation with the observations (∼0.8) and low values of error.
The MERLIN model can have wide Space Weather applications. It can be used by the scientific community to analyze specific events as well as to reconstruct the long-term radiation belt dynamics at the 120-600 keV energy range, for which there is generally a lack of models. Furthermore, it is of use for satellite operators for the nowcast of the MEO environment and can provide information for the surface charging analysis. We note that the model was trained on the data from the GPS constellation, which follows an inclined orbit. Therefore, at higher magnetic latitudes, the satellites sample only a part of the equatorial pitch angle distribution. However, using the appropriate pitch angle models, it is possible to reconstruct the values of equatorial flux from MERLIN predictions (e.g., using a methodology of Allison et al., 2018). Further directions for the present study include, first, a more refined analysis of the feature importance using the appropriate permutation and Shapley value methods and the corresponding feature selection. Second, as GPS continues to probe the outer radiation belt, more data can be incorporated into the model.