A Machine Learning Approach to Classifying MESSENGER FIPS Proton Spectra

The κ distribution function is fitted to the entire data set of MErcury Surface, Space ENvironment, GEochemistry and Ranging's (MESSENGER) 1‐min Fast Imaging Plasma Spectrometer (FIPS Andrews et al., 2007, https://doi.org/10.1007/s11214‐007‐9272‐5) proton spectra, and then artificial neural networks (ANNs) are used to assess the quality of this fit to the data. The κ distribution function is fitted to each proton spectrum using the downhill‐simplex method, providing an estimate for density, n, temperature, T, and the κ parameter, which controls the shape of the distribution. The final trained neural network achieved classification accuracy of 96% and has been used to classify the 1‐min proton data set collected during MESSENGER's ∼4 years in orbit of Mercury. Of the 223,282 spectra, ∼160,000 were classified as having “good” fitting κ distributions, ∼133,000 of which were measurements obtained from within the magnetosphere, and ∼18,000 were from the magnetosheath.


Introduction
The plasma mass loading of the magnetic field plays an important role in determining the time scales of processes which occur within any planetary magnetosphere. The Alfvén velocity is the characteristic speed at which magnetohydrodynamic (MHD) waves travel within a magnetized plasma and is defined by v A = B √ 0 , where B is the magnetic field magnitude, 0 is the permeability of free space, and is the plasma mass density. The plasma mass density therefore controls the propagation speed of Alfvén and magnetosonic wave modes which transmit energy, momentum, and information throughout the magnetic environment of a planet. The loading of the plasma also has an effect on magnetic reconnection rates; work by Swisdak et al. (2010) suggests that reconnection is diamagnetically suppressed when where Δ is the difference in between the magnetosheath and magnetospheric plasmas, is the shear angle between the magnetic fields which thread the two fluids, and L p d i = 1. For reconnection to occur during high Δ , the magnetic fields must be close to antiparallel (e.g., Phan et al., 2010;Swisdak et al., 2010), but when Δ is low, reconnection is more favorable and is able to occur with lower shear angles (e.g., Phan et al., 2010;Poh et al., 2016). This means that mass content of a magnetized plasma is an important quantity which regulates magnetospheric time scales, response times to solar wind fluctuations, and the nature in which two plasmas may interact and is therefore a crucial factor to consider when studying magnetospheric processes.
Thus far, there have only been two missions which have sampled the Hermean plasma. Mariner 10 was able to sample the electrons in Mercury's magnetosphere (Ogilvie et al., 1977), but there were no data collected on magnetospheric ions due to an instrument failure. The MErcury Surface, Space ENvironment, GEochemistry, and Ranging (MESSENGER Solomon et al., 2007) mission, which orbited Mercury from 18 March 2011 to 30 April 2015, hosted the FIPS instrument, one of two spectrometers which made up the Energetic Particle and Plasma Spectrometer (EPPS) instrument. FIPS was a time-of-flight instrument capable of sampling ion populations at 64 energy-per-charge (E∕q) levels, ranging from 46 eV to 13.6 keV, every 60 s in normal mode and 10 s in burst mode (Andrews et al., 2007). While it was orbiting Mercury, FIPS obtained at least 1,250,000 individual proton spectra in the vicinity of the Hermean magnetosphere. Many of these The moments that are currently publicly available are all based on the Maxwellian fitting mentioned above, which does not always accurately represent the observed spectra, thus providing an inaccurate measure of density and temperature in places. Those moments do have a quality flag associated with them; the flag is based solely on there being sufficient particle counts for a fitting to be attempted, but there is no assessment of the quality of the fit. The purpose of this study is to refit the entire data set of FIPS proton spectra with the distribution function, providing improved estimates of density, n, temperature, T, and . We take a nonlinear machine learning approach to assessing the quality of these fits to the data; artificial neural networks (ANNs) are used to determine whether or not the fit to the data is plausible, allowing us to retain the most useable FIPS spectra.

Data
The FIPS data products used here were downloaded from NASA's Planetary Data System (PDS) and are arranged into three data sets: (1) Experiment Data Records (EDR), (2) Calibrated Data Records (CDR), and (3) Derived Data Records (DDR). Figure 1 shows an example of some of the products of these data sets, where the proton counts, C, collected during the scan are shown in panel (a), the differential energy flux, dJ dE , in panel (b), and the phase space density (PSD), f , in panel (c). The abscissa of each panel represents the velocity calculated for each energy-per-charge bin, where the charge is assumed to be 1 (i.e. where i is the E∕q bin index). The pink and gray dashed lines in each panel represent two different model fits to the data, which will be discussed in greater detail later in section 3.
The counts measured at each of the 64 E∕q bins during the scan are obtained from the EDR "SCAN" data product. During FIPS' normal operation mode, one full scan is sampled every ∼60 s, but during burst mode, a full spectrum is provided every ∼10 s. For the purpose of this study, the burst mode spectra are grouped into typically six (sometimes five or seven) spectra and combined such that they are directly comparable to the 60-s spectra. The counts are also used to define the fractional error using Poisson statistics, which is used to calculate the error bars in Figure 1c, The differential energy fluxes presented in Figure 1b are found in the CDR "SCAN" data product and are related to the counts by Raines (2014), where i is the accumulation time (5 ms for burst mode and 950 ms for normal mode), the effective solid angle of the FIPS field of view Ω = 1.15 sr, i is the proton detection efficiency, N s is the number of scans which make up the 60-s spectrum (assuming that the counts from each energy bin, C i , are summed over all scans), and g is the energy-geometric factor, where g = 8.31 × 10 −5 mm 2 (eV/eV) (Raines, 2014). The PSD in Figure 1c is then calculated from the differential energy flux using where m is the mass of the ion species in kilograms, e is the elementary charge, and v i is the particle velocity corresponding to the (E∕q) bin i.

Refitting Spectra
The gray dashed lines in each of the panels of Figure 1 represent the fitting of an M-B function to the data provided by the "NTP" product of the DDR data set. The NTP fits have a corresponding density, n m , and temperature, T m , which are used to describe the overall plasma distribution.
The fits provided in the NTP data set are often a good representation of the observed proton PSDs; however, there are a number of spectra which do not fit a simple M-B distribution but would fit a -distribution much better. This is the case in the example shown in Figure 1, where the pink dashed lines in each panel represent the -distribution fit to the spectrum with a new corresponding density n , temperature T , and .
The -distribution function, f , is given by Pierrard and Lazar (2010): where Γ(x) is the Gamma function. The value of determines the shape of the distribution in velocity-space. High distributions tend toward the M-B distribution in the limit → ∞, which represents a plasma in a state of thermal equilibrium. Low distributions exhibit a significant deviation from the M-B distribution with a high-energy tail and are considered to be "suprathermal." This deviation from thermal equilibrium increases with lower values of until the limit where → 3 2 -at which point the plasma is in a state of "anti-equilibrium" (Livadiotis & McComas, 2013) where it approaches a power law distribution. This means that the distribution can be used to describe the FIPS spectra of both equilibrium (Maxwellian) and suprathermal plasmas. There are alternatives to using a single -distribution to describe the spectra: A suprathermal plasma could be modeled using the sum of two Maxwellians or the sum of a Maxwellian with a -distribution (e.g., Zouganelis et al., 2004); in either case, this would involve the fitting of more (four or five, respectively) free parameters instead of just the three fitted here.
The temperature of a plasma cannot be described with the kinetic definition of temperature, using the the average kinetic energy, ⟨E⟩ = ⟨ 1 2 mv 2 ⟩ (Maxwell, 1867). Instead, the thermodynamic definition of temperature is used (Livadiotis & McComas, 2013): where S and U are the entropy and internal energy. Livadiotis and McComas (2009) showed that the definition of temperature in equation 6 with its dependence upon was equivalent to the kinetic definition and could be used to describe the temperature of fluids which are not in thermal equilibrium. The temperature in 5 is related to the classical temperature, T 0 , using where the factor of 3 2 arises from the assumption of a monatomic plasma with three degrees of freedom, and T 0 approaches the temperature of a Maxwellian distribution when → ∞. For the ease of comparison between Maxwellian and distribution parameters, the temperature of the distribution will be referred to as T , where T ≡ T 0 . This means that when is large, T → T m .
Although MHD simulations have had some success modeling of Mercury's magnetosphere (e.g., Jia et al., 2015), it is considered that kinetic effects (non-MHD) are important due to its relatively small-scale size (Baumjohann et al., 2006;Fujimoto et al., 2007) compared to the Larmor radii of the plasma ions (e.g., r L ≈ 1,000 km for sodium ions with v = 400 km s −1 and B = 100 nT). This becomes even more significant as the plasma distribution deviates significantly away from thermal equilibrium as more particles exhibit high velocities. The Debye length of a suprathermal plasma, , is also related to that of a Maxwellian plasma, , by = (2 − 3)(2 − 1) (Chateau & Meyer-Vernet, 1991;Pierrard & Lazar, 2010), which suggests that as → ∞, → ; but for lower distributions, the < . With a reduced Debye length at low , the Debye sphere contains fewer ions and is therefore less likely to behave like an ideal plasma.
Using the original density and temperatures provided by the NTP data product, (n m , T m ), as a starting point, all of the FIPS spectra were refitted to provide new density and temperature estimates n , T (here we use T 0 from equation 5 as it is directly comparable to T m ), along with a value for . This fitting was performed by numerically minimizing the function using the downhill simplex method (Nelder & Mead, 1965), where C i is the distribution function converted from PSD to counts for the ith E∕q bin and n is the total number of bins (n = 64). The value of C is computed by combining equations 3 and 4 to get where q = e and E i = 1 2 mv 2 i . Figure 2 shows a comparison between the original Maxwellian values of density (panel a) and temperature (panel b) with those obtained by fitting the distribution. Each panel shows a 2D histogram of occurrence where the abscissa represents the original values of n m and T m and the ordinate is the newly fitted n and T ; the green dashed line in each plot represents a 1:1 ratio between the old and new fits. In Figure 2a, the densities predicted by both fits remain largely similar, with a very slight deviation toward higher n predictions when n m > 10 cm −3 . The temperatures presented in Figure 2b also show that many of the new fits are close to the original values, though there is a little more spread about the 1:1 line than in panel (a). There is expected to be some difference in the temperatures predicted by the two different distribution fits. Nicolaou and Livadiotis (2016) showed that when a Maxwellian function is fitted to plasma with a distribution, significant differences can arise when the plasma is not in thermal equilibrium. An examination of the parameters fitted to the spectra shows that > 10 for approximately 73% of measurements, which means that they would be adequately described using a Maxwellian; the other 27% have low and would not be well represented using a Maxwellian.

Neural Network Configuration
Ultimately, the aim for this work is to discard any FIPS spectra which cannot be used to provide a reliable estimate of the density and temperature of the plasma in the Hermean environment. This is achieved by assessing the goodness of fit between a plasma distribution model and the data, which is a straightforward task to perform manually by visual inspection when there are only a small number of spectra, but with large numbers (here 223,282 spectra), this would be a very time consuming process, with inconsistent results. Using traditional measures of goodness of fit (e.g., defining a threshold value for 2 ; see section 6 for some examples) does not work very well with these data as parameters such as phase space density extend to many orders of magnitude. Instead, ANNs may be trained to classify whether or not data are reliable using a small subset of these spectra. The model then produced by the trained neural network can then be applied to the entire data set to provide a consistent classification of each spectrum. In this paper, we make use of modular neural networks, where multiple networks are trained on small sections of the spectra, the results of which are then used as inputs to a final deep neural network. This section describes in detail the architecture or the ANNs, feature selection and transformation, and neural network training.

Network Architectures
The method described here uses a total of nine "feed-forward" ANNs to determine whether the -distribution fit to the FIPS data is good or not. A feed-forward neural network consists of three types of layers: input, hidden, and output. These layers are comprised of "nodes," which are somewhat analogous to neurons; each node takes the values of multiple inputs, combines them, and uses some activation function to provide an output value. The process of propagating data through a feed-forward network begins by inserting the data directly into the input layer. The output of any given layer then becomes the input to the next layer, where there is a connection between every pair of nodes in two adjacent layers (i.e., each node has an input which corresponds to the output of every node in the previous layer). The data are propagated through the network layer by layer, until the output layer is reached (this process will be discussed in further detail in section 4.4 and Appendix B). The layers which exist between the input and output layers are the hidden layers; they are only connected to the adjacent layers within the network and do not provide any visible input or output to the network. Figure 3 shows a schematic of the overall modular network, where data are input into eight separate networks initially, the outputs of which are subsequently fed into a final network alongside other input data. The first layer of ANNs will be referred to as the "split" layer of the modular neural network, and the ANN which combines these to provide the output of the modular neural network will be referred to as the "final" layer.
In the split layer, each network assesses the quality of the model fit to the data for one eighth of the spectrum, each section comprising eight E∕q bins. Figure 4a depicts the architecture used for the split ANNs, where each of these networks is identical and consists of four layers, L, represented by shaded rectangles. The first layer is the input layer (on the left), the second and third are hidden, and the final layer is the output layer (on the right). The number of nodes in a particular layer is denoted by s l , where the subscript l corresponds to the layer index. The input layer, s 0 , is defined by the number of input "features" (discussed further in section 4.2) such that s 0 = 31. The number of output nodes is defined by how many class labels there are for the data-each section of the spectrum must be classified as either "good" or "bad," so s 3 = 2. 10.1029/2019JA027352 Huang (2003) found that the learning capacity of a feed-forward neural network with two hidden layers can be found using and where K is the number of output classes (K = s 3 = 2) and N is the number of samples. We choose N such that it is sufficient to model the cross-validation set, rather than the training set (both will be discussed further in section 4.3), which would implicitly add some level of regularization to the networks by reducing the total number of free parameters. Using equations 10 and 11, we find that s 1 = 39 and s 2 = 13 for the split layer networks.
The architecture of the final layer is portrayed in Figure 4b-this ANN has more inputs than that of the the split layer; there are 239 input features. The sizes of the hidden layers were determined using equations 10 and 11; layers s 1 and s 2 contain 68 and 24 nodes, respectively. The final layer has just two output nodes, as with the split layer networks.

Input Features
The input data for all nine of the networks in both split and final layers of the modular neural network are derived from the EDR and CDR data sets, or from the refitted -distributions of section 3, or a combination of both. Scalar parameters such as density, temperature, and values each provide a single "feature" of the input data, where a single feature corresponds to a single input node. Multielement parameters such as C or Δf (the difference in measured and modeled PSD) would provide multiple features and therefore correspond to multiple input nodes to the neural networks. The full list of parameters used in the networks are listed in Table A1, along with a brief description and the number of input nodes that they each require for the split and final ANNs.
While some of the parameters in Table A1 are either binary or restricted to values between 0 and 1, many of the parameters have a large or unlimited range of possible values. In order for the neural network to successfully utilize the input features and find an optimum of the cost function (see section 4.4) within parameter space, the input parameters should all have similar ranges in their value (e.g., Sola & Sevilla, 1997). If there is no scaling performed, then a step of a given length in parameter space, along the axis of a parameter which varies very slowly, will take a long time to find a local optimum, whereas that same step along the axis of a parameter which varies over a very short range may jump across the optimum, skipping the optimal solution entirely.
The input parameters which could have values outside of the range 0 to 1 were transformed using the Box-Cox transform (Box & Cox, 1964), such that their distributions were close to normal. The resultant normal distributions were then rescaled such that they had a mean = 0 and standard deviation = 1. For a more detailed description of the Box-Cox transform and the rescaling process, see Appendix A.

Labeled Spectral Samples
For training all of the neural networks, it was necessary to manually collect and label (i.e., "good" or "bad") a relatively small number of samples. For the split networks, a total of 1,000 spectra were split into their eight sections, each of which was manually assigned a class label of "1" (bad) or "2" (good), corresponding to the output node we wish to activate. Of these 1,000 samples, ∼ 2 3 (667) were used to form the "training set" for training the neural networks; ∼ 1 6 (167) formed the "cross-validation set" which is used to determine and minimize the degree of over-fitting in the networks; and ∼ 1 6 (166) samples were the "test set," which is used to provide the final measurement of accuracy for each network.
For each split section, there was an imbalance between the number of good and bad samples. A highly skewed sample set, where many more samples of one class exist than the other, could cause the neural network to be trained incorrectly: If there are 90% "good" samples, then a local optimum would exist where the network classes everything as "good," automatically giving it an accuracy of 90%. In order to avoid this, samples with the least common class label in the training sets were replicated in order to balance the classes (i.e., all of the samples in the smallest class were duplicated over and over until the number of samples matched the other class). Table 1 shows the number of samples of each class which form the training set for each split ANN and also the effective number of samples after the replication of spectra with the least common class label.
For the final network, NN 8 in Figure 3, a total of 3,000 labeled samples were collected, where there were exactly 1,500 of each class. The samples were split into the three sets with the same proportions as with the split layer; the training, cross-validation, and test sets contained 2,000, 500, and 500 samples, respectively, where each set had an even balance of both "good" and "bad" spectra.
For each of the aforementioned sample sets, the rescaled features of each spectrum (as described in section 4.2) are placed into a feature matrix, X ∈ R (m×n) , where m is the number of samples and n is the number of features (i.e., the number of input nodes). The class labels are also placed into a "one-hot" class matrix, y ∈ {0, 1} (m×K) , where K corresponds to the total number of class labels (i.e., the number of output nodes). In this matrix, all elements are set to 0, except for the elements corresponding to the appropriate class labels for each sample, which are set to 1. The aim is to train the neural network such that, given the training set feature matrix, X, it reproduces the target class matrix, y, from the output layer.

Network Learning
In order for the neural network to correctly classify spectra, appropriate values for the weights and biases must be learned which correctly map the input features to the desired output class. The first step in the neural network learning process is to propagate the data from input matrix, X, through the network. As previously mentioned in section 4.1, the input data are propagated through the network via a series of matrix operations and activation functions between each layer in the network-until a hypothesis matrix, h, is produced at the output layer. The hypothesis matrix contains the activations of the output layer for each input sample and output node; the class label assigned to each sample is defined using the node which provides the greatest activation value. A more detailed description of neural network propagation is in Appendix B.
To assess how well the output of the neural network represents the class labels of the training data, we make use of the cross-entropy cost function, where the first term provides a measure of the difference between the predefined class labels, y, and the output hypothesis, h, summed of m samples and K output nodes. The second term is the L2 regularization term, which uses the sum of the square of the weight matrices, w, to regularize the network. L2 regularization acts to keep the weights of the network small, reducing the chance of any nodes becoming saturated, effectively smoothing the neural network response (Hagan et al., 1996, chapter 13) and thus reducing overfitting.
Learning is achieved by modifying the weight and bias matrices (w and b, respectively) such that the cost function is minimized, corresponding to reducing the difference between the output hypothesis and the predefined class labels. Modifying the weights and biases is often done using gradient descent, where the gradients of the cost function (equation 12) with respect to each individual weight and bias are used to determine the direction in which to make a small step within "weight-space" toward a minimum in J. The gradients are calculated using the back-propagation algorithm (Rumelhart et al., 1985) (see Appendix C for a complete explanation). In order to train the neural networks used in this study, two gradient-descent-based algorithms with adaptive learning rates were used: For the split networks (NN 0-7 of Figure 3), we used the iRProp+ (Igel & Hüsken, 2003) algorithm for batch learning; for the final neural network (NN 8 of Figure 3), we used the RMSProp algorithm with mini-batch learning (both algorithms are discussed in more detail in Appendix C). Full batch learning uses all of the input samples to decide every single learning step, so there is one step for every pass through the input data (every "epoch"). With the final neural network, NN 8, mini-batch learning was used, where the input data are split into smaller batches and a small step is made for each batch, that is, there are multiple steps taken for each epoch (10 steps for mini-batches of 200 samples). The advantage to the mini-batch approach is that the learning algorithm is more likely to be able to jump out of a false optimum in J.

Training the FIPS Modular Network
This section describes the process used to optimize all nine ANNs which make up the FIPS modular neural network. The following six steps taken to classify the entire data set of FIPS proton spectra will be described in more detail in subsequent subsections.

Pretraining
The method of pretraining used in Steps 1 and 4 involves training each ANN with the training set, with no regularization. In this case, the networks soon become overfitted to the training data, with typical cross-validation accuracies in the range of 85-95%. These networks are then used to provide a very approximate or "loose" class label to the entire data set of proton spectra. The overfitting of the neural networks at this point is not an issue because the majority of spectra would still be correctly labeled, and the fine tuning of the networks can happen after the pretraining stage.
Using the ∼223,000 loosely labeled spectra as a temporary training set, the networks with randomly initialized weights and biases can quickly be trained in 10 epochs, using the RMSProp algorithm with a mini-batch size of 10,000 samples. The training accuracy achieved with the loosely labeled spectra ranged from 94-98% for ANN 0-7 and was 95.3% for ANN 8. The weights and biases for each of these networks were then saved for use as a good starting point for training networks using the training sets defined in section 4.3. This process was first done for networks 0-7 and then to ANN 8 after Step 3 was completed.

Regularization and Classification
Once pretraining of a network is complete, the weights and biases found in the previous section are used to initialize neural networks in order to find the optimal regularization value, . Eighty-one values of logarithmically spaced from 10 −5 -10 3 were tested for all nine neural networks. In each test of ANNs 0-7, the networks were batch trained for 200 epochs, which is enough for overfitting to occur in unregularized networks. ANN 8 was also trained for 200 epochs but using the RMSProp algorithm with mini-batches of 250 samples. When is set too low, the neural networks are more susceptible to overfitting to the training data and perform badly on the cross-validation set because they have not generalized well. In the cases where is set too high, the neural network is unable to converge on a solution which accurately describes the training set because regularization smooths the response of the network too much, forcing it to underfit, and therefore performs badly on both training and cross-validation sets. A good value for the L2 regularization parameter is straightforward to find; an optimal provides a balance between underfitting and overfitting and can be characterized by a peak in cross-validation accuracy. Table 2 shows the values found to provide the best representation of the cross-validation set for each network; the associated training, cross-validation, and test set accuracies are also provided.
The weights learned in these neural networks were then used to classify whether the fit to each section of each spectrum (Step 3), and subsequently the whole of each spectrum (Step 6), was "good" or "bad" (please see the supporting information for the final trained weight and bias matrices). Of the 223,282 spectra, 162,571 (∼73%) were classified as having good fits with the distribution. Table 3 shows the numbers of good and bad spectra found at different locations around and including the magnetosphere. The regions of the space around the magnetosphere were defined using the MESSENGER MAG data to determine when the spacecraft had crossed a boundary (magnetopause or bow shock) with the method described by Winslow et al. (2013). The closed field line region in Table 3 was found by performing a magnetic field trace from the known location of MESSENGER at the time of each spectrum, using the KT17 (Korth et al., 2017) magnetic field model; any traces which had two footprints on the planet were considered to be closed.
There is one important caveat which should be restated: Because of limitations in field of view due to FIPS placement on the spacecraft, FIPS does not observe full plasma distributions in most cases and the relationship of these moments to the true plasma moments depends on some assumptions (Raines et al., 2011). The expected Mach number (M), the ratio of the bulk to the thermal speed of the plasma, provides some guidance on this matter. In the solar wind (M > 3-4), Gershman et al. (2012) showed that enough of the distribution is measured for bulk speed and temperature to be within 10% of true values, as long as some effort is made to exclude low count or non-solar wind periods (e.g., foreshock). This method should effectively make those exclusions. In sufficiently subsonic plasmas (M < 0.5), where the distribution has a much higher thermal speed than bulk speed, moments recovered from FIPS observations can be accurate provided that the distribution is reasonably isotropic. These two conditions will typically hold in the central plasma  sheet, especially when 1-min averages are used to smooth out variations. The same is true in most dayside magnetosheath crossings, especially within 30 • of the Sun-Mercury line. In contrast, these assumptions can be easily violated in nightside magnetosheath crossings, because the plasma will likely have returned to supersonic, though not sufficiently so for moment recovery. The fits for solar wind and the nightside magnetosheath spectra should therefore be treated with caution. Figure 5 shows how the "good" spectra in some of the populations in Table 3 compare in terms of their relative densities and temperatures. The magnetospheric population (MS, in blue) is by far the largest, with T typically in the range of 1-100 MK, and n around 0.3-30 cm −3 . The second largest population is the magnetosheath (SH) in red; this population is lower in temperature but higher in density than that of the MS population. The spectra obtained at the magnetopause (MP) in yellow appear mostly where the MS and SH populations overlap; the MP crossings described in Winslow et al. (2013) are actually defined by the inner and outermost crossings during each pass of MESSENGER into the magnetosphere from the magnetosheath and vice versa, where MESSENGER often experienced multiple crossings through the boundary-so these spectra would most likely be of a mixture of MS or SH plasma. The bow shock (BS) and solar wind (SW) spectra are also shown in green and black, respectively, but due to the high bulk velocity, these values of T and n should be treated with caution. Figure 6 shows nine examples of spectral fits classified using the FIPS modular neural network. In all nine panels, the PSD is plotted against the velocity, with the fitted distribution shown in pink as in Figure 1c; also shown in each of the panels are seven vertical dotted lines which separate the eight sections of the spectrum which would be classified using ANNs 0-7. The probability that each section is a "good" fit is shown at the bottom of each section, where the color is indicative of the class label: green is "good," and red is "bad." The overall class label output by ANN 8 is also indicated using color (with the associated probabilities shown top center)-the backgrounds of Figures 6a to 6f are shaded green to denote good overall fits; the backgrounds of Figures 6g to 6i are red, indicating that these fits were deemed to be "bad" by the final neural network. The title of each panel displays the date and time of the spectrum, with the value of 2 for comparison with the ANN classification. Figure 6. Using a similar format to that of Figure 1c, panels (a)-(f) show the phase space density of six proton spectra, each of which was classified as having a "good" fitting distribution, and panels (g)-(i) show spectra classified by the neural networks as having a "bad" fit. In all panels, the vertical dotted lines represent the eight sections of the spectra, and each corresponds to a network in the split layer. The new -distribution fits are shown as pink dashed lines, where the gray dashed lines are the original Maxwellian fits from the NTP data set for comparison. The probabilities of each section being "good" are shown in boxes at the bottom of the panels, and the probability output from the final network is shown at the top of each panel. The green and red shading in all plots is indicative of the class label output from the network-"good" and "bad," respectively. The title of each panel shows the date and time of the spectrum, alongside the value of 2 .

Examples
Panels (a)-(c) of Figure 6 each show a good fit to distributions with high values which are a close match to the original Maxwellian fits in gray. The spectra in panels (a) and (b) both have four sections which were classified as being good, but both also exhibit large 2 values, whereas the spectrum in panel (c) has only two "good" sections and much lower 2 values. Panels (d)-(f) all show "good" overall fits to non-Maxwellian suprathermal plasma distributions, two of which (d and e) have very large 2 > 100, where these fitted distribution functions provide very different plasma moments to those of the original NTP data set. The densities predicted in panels (d) and (e) differ by ∼15% when compared to their Maxwellian counterparts; the most notable difference is in the temperature, which appears to be overestimated when using the M-B function to represent a non-Maxwellian spectrum. Panels (g)-(i) show three which ANN 8 considered to be "bad" fits; the spectra in panels (g) and (h) are both characterized by lower values of 2 than all six "good" spectra, which is unexpected because a good fit should be characterized by a small 2 . Panel (i) shows a bad fit to a bimodal distribution, where the new distribution appears to fit the higher energy of the two peaks and the original Maxwellian does not fit either peak. Bimodal distributions in FIPS proton spectra like that in Figure 6i are an unusual case, the study of which is beyond the scope of this paper, but may be of interest for further study. There is no single threshold value of 2 which could successfully divide the "good" and "bad" fits to the spectra, as demonstrated in Figure 6. . Panel (d) shows the probability output from ANN 8 for each spectrum in red, where the green and red shading in the background denotes the class as in Figure 6. Panel (e) shows the magnetic field in MSO coordinates during this pass, where red, green, and blue correspond to the x, y, and z components, and black is the total magnetic field magnitude. In all panels, the black dashed vertical line shows the time when MESSENGER passed through the magnetic equatorial plane, yellow vertical shaded areas correspond to the magnetopause crossings, and the red shaded areas represent the pass though the bow shock. correspond to the x, y, and z components of the field in the Mercury-centered solar orbital (MSO) coordinate system, and in black is the magnetic field magnitude, ±|B|. Present in all panels is a black, dashed vertical line which shows the time when MESSENGER crossed the magnetic equatorial plane; vertical yellow shaded areas are the times where MESSENGER crossed the magnetopause; the vertical red shaded area shows when MESSENGER crossed the bow shock. Near the start of Figure 7, at ∼5:30, MESSENGER crossed the magnetopause into the southern tail lobe of the magnetosphere. At around 6:40, MESSENGER crossed the magnetic equatorial plane in Mercury's magnetotail as it traveled northward. Subsequently, MESSEN-GER crossed the northern cusp at around 7:15, evidenced by an enhancement in proton density and flux. Then at around 7:30, MESSENGER crossed the dayside magnetopause into the magnetosheath, where densities were much higher and temperatures much lower, and finally through the bow shock at around 7:45. The densities and temperatures predicted using our new method are not too dissimilar to the original NTP values in most cases, but in some instances in this example, densities are up to ∼29% higher and temperatures ∼50% lower. Generally, the new plasma moments represent a better fit to non-Maxwellian plasma distributions, and the neural networks have removed the majority of the unreliable and unusable data, as illustrated in Figure 7.

Conclusion
In this paper, the FIPS proton spectra have all been refitted to a distribution function, which provides a better fit to non-Maxwellian plasma distributions present in the Hermean magnetosphere. This means that more accurate estimates of the plasma moments can be used to describe the Hermean plasma. Refitting of the data is coupled with a machine-learning-based method of classifying badly fitted spectra, whether the data have too few counts to provide a reasonable representation of the plasma distribution or the distribution could not be used to describe the plasma (for example, if there is a bimodal velocity-space distribution).
Once the best fitting moments of the plasma are separated from the badly fitted ones, they can be used to provide accurate estimates of the typical plasma properties observed in different regions of the Hermean magnetosphere. Plasma moments are vital for a whole range of magnetospheric topics, where future work using these moments could include characterizing the variations in plasma mass densities as a function of distance along magnetic field lines, or to study the variations in the plasma properties at Mercury when extreme solar events such as interplanetary coronal mass ejections (ICMEs) occur.
The methods applied above are also not limited to the FIPS proton spectra. FIPS measurements of heavier ion populations at Mercury such as Na + would be a candidate for using this method, as would other plasma data sets, such as the measurements due to be made by the BepiColombo mission, when it reaches Mercury in 2025.

Appendix A: Input Parameter Transform
The parameters, x, that were not restricted within the range 0 to 1 were scaled using the Box-Cox transformation (Box & Cox, 1964), where is the power parameter, x s is the x parameter shifted by some constant, such that all values are greater than zero, and x ( ) is the transformed parameter. The value of was obtained by maximizing the log-likelihood function, wherex and n is the number of elements in x. This transforms the distributions of each parameter such that they are close to normally distributed. Then each parameter is normalized to a similar scale using Box-car smoothed counts, using a window length of 7 (the black line in the top panel of Figure 1).
T 1 1 Temperature of -distribution fit.
y p 0 8 Probability that each split section is "good," obtained from the split layer of the modular neural networks. y c 0 8 "Good" or "bad" class label for each split section, obtained from the split layer of the modular neural networks. Figure A1. Panel (a) shows the distribution of the fitted densities, n , for all of the FIPS spectra. Panel (b) shows the distribution of the densities following the Box-Cox transform. Figure A1 shows an example of this transformation for the density parameter of all 223,282 FIPS spectra, where panel (a) shows the distribution of the original data and panel (b) shows the distribution of the transformed data. The parameters used in this transformation are shown in the upper-right corner of panel (b). The newly transformed data have been transformed into an almost Gaussian distribution, with its mean centered on 0. This step in the process of analyzing the distribution fits to the FIPS data is an intermediate step, which is fully reversible (i.e., the new input data can be transformed back to their original values).