Benchmarking Forecasting Models for Space Weather Drivers

Space weather indices are commonly used to drive operational forecasts of various geospace systems, including the thermosphere for mass density and satellite drag. The drivers serve as proxies for various processes that cause energy flow and deposition in the geospace system. Forecasts of neutral mass density is a major uncertainty in operational orbit prediction and collision avoidance for objects in low earth orbit (LEO). For the strongly driven system, accuracy of space weather driver forecasts is crucial for operations. The High Accuracy Satellite Drag Model (HASDM) currently employed by the United States Air Force in an operational environment is driven by four (4) solar and two (2) geomagnetic proxies. Space Environment Technologies (SET) is contracted by the space command to provide forecasts for the drivers. This work performs a comprehensive assessment for the performance of the driver forecast models. The goal is to provide a benchmark for future improvements of the forecast models. Using an archived data set spanning six (6) years and 15,000 forecasts across solar cycle 24, we quantify the temporal statistics of the model performance.


Introduction
Accurately quantifying mass density in the thermosphere remains a predicament for the community. The difficulty stems from the highly dynamic nature of the thermosphere, an environment driven by a number of factors ranging from solar extreme ultraviolet (EUV) and geomagnetic heating to gravity waves in the lower atmosphere. Emmert (2015) provides a thorough overview of the physical drivers and their effects on thermospheric density. Current capabilities limit our ability to predict satellites' trajectories with precision in an operational setting. During large solar and geomagnetic storms, operators struggle to locate many resident space objects, let alone have the means to predict their orbits (Berger et al. 2020). Many resources in the United States and abroad are devoted to tracking satellites and determining their orbits in order to protect humans and other assets in space.
HASDM (Storz et al. 2005) is an assimilative empirical model that uses a large batch of calibration satellites to make corrections to a density nowcast from the Jacchia-Bowman 2008 (JB2008) model (Bowman et al. 2008(Bowman et al. , 2012. The resulting density data cube is propagated in time using driver forecasts supplied to JB2008. The forecasts are deterministic in nature. -2-

Confidential manuscript submitted to Space Weather
The JB2008 models neutral density in the thermosphere using global exospheric temperature equations that leverage four solar indices to simulate thermosphere heating from different sources of solar energy (Tobiska et al. 2008a;Bowman et al. 2008). The F 10.7 proxy has a strong correlation to solar extreme ultraviolet (EUV) irradiance which has led to its long-time use as measure of solar EUV energy. S 10.7 is an index indicative of activity of the integrated 26-34 nm bandpass solar chromospheric EUV emission, which penetrates to the middle thermosphere and is absorbed by atomic oxygen. The M 10.7 proxy is used as a measure of far ultraviolet (FUV) photospheric 160 nm Schumann-Runge Continuum emissions, which penetrate to the lower thermosphere and cause molecular oxygen dissociation. The fourth solar index is Y 10.7 which is a composite of X b10 and Lyman-alpha. This serves as a composite measure of solar coronal 0.1-0.8 nm X-ray emissions and 121.6 nm Lyman-alpha, both of which penetrate to the mesosphere and participate in water chemistry. In order to forecast these indices/proxies, SET uses a linear predictive algorithm that captures persistence and recurrence (Tobiska et al. 2008a).
To capture the impact of geomagnetic activity, the model uses a synthesis of ap and Dst indices. The ap index is a measure of global geomagnetic activity derived from twelve observatories that fall between 48 • N and 63 • S in latitude (McClain and Vallado 2001). The utilization of ap during quiet geomagnetic conditions results in low density errors, but Dst proves to be a more effective driver during storm times (Bowman et al. 2008). Dst is an index that represents the strength of the storm-time ring current in the inner-magnetosphere (Tobiska et al. 2008a). Its forecast is generated using SET's Anemomilos algorithm, which provides a forecast with maximum prediction window of six days (Tobiska et al. 2013 (Steenburgh et al. 2013;Singer 2013;Haiducek et al. 2017). This forecast only extends three days, so the value is set to zero for the last three days of each window. -3-

Confidential manuscript submitted to Space Weather
Even though SWPC only recently switched to using the Geospace Model, this data represents the official NOAA SWPC forecast and we use it as such.
Errors in the space weather driver forecasts cause errors in the resulting densities, therefore impairing satellite conjunction analyses. Bussy-Virat et al. (2018) recently performed a study to show the effects on driver uncertainty on the probability of collision between two space objects. A similar study was performed more recently by  incorporating additional forecasts and further conditioning distributions. The probabilistic F 10.7 forecasts in Figure 1 were generated using the statistical measures identified in the current study. There was a constraint of the maximum change in the driver (dF 10.7 ) from one time-step to the next. This limiting factor was chosen through further statistical analyses. Each driver forecast was input to a quasi-physical model of the mass density built using recurrent neural network to forecast a resulting 3D density grid that would be used in orbit propagation (Mehta et al. 2018;Mehta 2019, 2020). The satellite position distributions give light to the need for probabilistic approaches in determining satellites' orbits. In this fairly quiet case, there was a ∼6.4 km position error with deterministic approaches. Probabilistic forecasting allows for the true position to be captured through -4-Confidential manuscript submitted to Space Weather the analysis. Here, the mean probabilistic position was more accurate than the deterministic position. Figure 1 is a derivative of this work.
We expand upon the work of Bussy-Virat et al. (2018) by using (i) all solar and geomagnetic drivers that are used in operations, (ii) a large historical data set covering a period of six (6) years, (iii) an extended forecast window of up to six (6) days, and (iv) the initial driver values to characterize model performance as a function of the solar and geomagnetic activity.
The outline for the current paper is as follows: the following section introduces the techniques and thresholds to bin solar and geomagnetic drivers. This is done separately between the domains and presents distinct methods. Next, the resulting uncertainty figures are presented and discussed followed by the conclusion.

Methodology
The SET algorithms produce files every three hours generating updated six-day forecasts for solar and geomagnetic indices. These forecasts have a temporal resolution of three hours. In addition, they archive the observed values for each time step. To conduct this analysis, forecasts from October 2012 through the end of 2018 were used with the exception of some missing/corrupted forecasts. In total, there were over 15,000 files to leverage for this study.
In order to effectively examine the solar and geomagnetic indices in comparable terms, a consistent approach had to be determined. To provide the clearest possible representation for all indices, different methods are used for solar indices and geomagnetic indices but kept consistent within the domains. Each index was split into separate sub-populations depending on the initial forecasted value. Populations that ended up with fewer than 100 forecasts are not shown, because there is insufficient data to draw statistical conclusions.

Solar Indices
The task of generating statistical results for the four (4) solar indices investigated (F 10.7 , S 10.7 , M 10.7 , and Y 10.7 ) was relatively straightforward. The forecasts are generated using SET's SOLAR2000 algorithm (Tobiska et al. 2000(Tobiska et al. , 2008b. The thresholds to assess activity level were determined by the experience of previous work combined with a supplemen-  The thresholds for F 10.7 had been previously specified and used in previous work (Mehta 2013;. Using these partitions on the 15,000+ forecasts resulted in a distinct number of individual F 10.7 forecasts for each activity level. This was used to classify the remaining solar indices, with the absence of a natural partition. A natural partition within a distribution is seen at 150 sfu for S 10.7 . This was chosen for the particular threshold as it did not greatly disrupt the number of forecasts in the adjacent activity levels. The four levels of solar activity are defined in Table 1. -6-Confidential manuscript submitted to Space Weather that was more positive than the issued value. All of the solar indices are updated daily, there are twenty-four distributions for each (four magnitude-based and six temporal partitions).

Geomagnetic Indices
The analysis of the two geomagnetic indices, ap and Dst, was more intricate. Not only are the uncertainties functions of their magnitudes and time from epoch, they vary with solar activity level. To analyze ap, three geomagnetic activity levels were chosen: low, moderate and active. In analyzing Dst, six geomagnetic activity levels were chosen and are consistent with the NOAA G-scale as operationally applied by SET. Table 2 states the thresholds for ap and Dst.
To allocate the geomagnetic forecasts, the largest value in the forecast for ap and the most negative value for Dst is the controlling factor. In addition, the forecast is classified by the initial forecasted F 10.7 value. Since the distributions have a finer temporal resolution and a solar dependency, there are 1,152 distributions for ap and Dst.
-7-Confidential manuscript submitted to Space Weather It becomes difficult to generate a standard percent error normalized by the issued value, because the issued value can be small or even zero. Therefore, another method had to be chosen to provide a similar comparison. Instead of normalizing errors by the issued value, they are normalized by the long-term mean value of the index. Therefore, an error of −200% for Dst signifies an error twice the magnitude of the long-term mean Dst, and the prediction was more-negative than the issued value. The long-term mean values for ap and Dst are 9.2 2nT and -8.8 nT, respectively.

Results
In the resulting uncertainty figures, the mean and standard deviation of forecast error (as a function of time from current epoch) are presented for each activity level. This way, biases can be identified and the algorithm's temporal uncertainty can be determined. Figure   3 shows the performance of the F 10.7 forecast algorithm.  over-forecast in addition to a large uncertainty. The uncertainty also does not consistently grow with time.
The F 10.7 and S 10.7 algorithms are both vulnerable to high solar activity, but the comprehensive effectiveness is visible. The limitation during high activity is due to the volatility of the sun during solar maximum, i.e, the inability to accurately forecast flares and the lack of information from the solar East limb and solar far-side active region's growth. The algorithms for the remaining indices prove to be more robust to solar activity. The M 10.7 performance is presented in Figure 5.
For M 10.7 , there is a minimal bias of ±2% for the lower two activity levels, but at low solar activity, there is a slight tendency to under-predict. At elevated and high solar activity, the bias is accumulating with time and increases in intensity. Across all levels, the uncertainty starts below 4% and grows steadily with time. An interesting characteristic that con--9-Confidential manuscript submitted to Space Weather trasts the prior two indices is the lower uncertainty at high solar activity. The difference in performance is not drastic relative to the other conditions.
To conclude the analysis of the solar indices, Figure 6 shows the performance of the Y 10.7 algorithm. Relative to the previous three indices, the Y 10.7 algorithm is considerably robust to activity levels and has less overall uncertainty. In the first two activity levels, the bias is less than ±1% for nearly the entire prediction window. The uncertainty grows with time for all activity levels, but its magnitude is less significant than the other indices. The bias never exceeds 5% and the uncertainty 12%.
As previously stated, the geomagnetic indices were more difficult to analyze due to an increase in dependencies and a finer time resolution. Each geomagnetic index has its own set of activity levels but are both based on the previous F 10.7 thresholds. The performance of the ap forecasts is shown in Figure 7.
-10-Confidential manuscript submitted to Space Weather  During low geomagnetic activity (across all solar activity levels), there is no significant bias detected. With moderate geomagnetic activity, there is a general over-prediction that decreases over the three-day provided forecast. It shows a possible path for prediction improvement by relying on persistence when ap is high at the start of the forecasts. Another key determination is shown by the right-most panels where there is only a single forecast that has a value greater than 50 2nT. This reflects the difficulty in quantifying the intensity of a storm, even with the aid of a physics-based model.
The last algorithm analyzed is SET's Anemomilos for Dst forecasts, shown in Figure 8.
The G5 row is not shown. There was only a single forecast where a G5 storm was expected.
There are only 9/24 conditions with enough forecasts to perform the analysis, but the remaining results provide insight to the strengths and weaknesses of the algorithm.
-12-Confidential manuscript submitted to Space Weather In the top-left subplot (when conditions are quiet), the forecasts remain relatively unbiased, and the uncertainty slowly increases with time. Figure 8 shows a general tendency to predict Dst to be more positive for nearly all G0 and G1 conditions, with the exception of G1 low solar activity conditions. In this case, the algorithm has a strong bias to expect Dst to be ∼ 23nT more negative than the issued values over the first four days of the forecast. Following the strong inclination after day four, the algorithm tends to neutralize the bias. This is interpreted as accurate prediction of Dst recovery to quiet conditions.
-13-Confidential manuscript submitted to Space Weather The bias for G1-G3, moderate solar activity conditions shows a strong temporal dependency transitioning from under to over prediction in each case. G2 moderate solar activity is a case with a peculiar trend of the uncertainty decaying with time from epoch. This is also the case for G3 moderate solar activity. However, this case has extreme and unclear results with both the bias and uncertainty changing rapidly with an inverse relationship. This behavior points to a need for improvement in these conditions. A source of the Dst variability in G0-G3 conditions is the high-speed stream (HSS) and Anemomilos does not model these events.
The analysis of the SET algorithms used by the JB2008 and HASDM models provided clear performance capabilities for the current standard for density model driver forecasts.
This work showed the many strengths of these predictive algorithms while also showing conditions where improvements can be made. In general, the forecasting capability for solar indices at low and moderate activity levels has comparably low uncertainty and virtually no bias. This performance is degraded to an extent at elevated and especially high activity levels, where the sun is more volatile.
The best performing algorithm is for Y 10.7 whose forecasting method is the most complex of the four solar indices investigated. The algorithm for M 10.7 also has low uncertainty and low bias at the two lower solar activity levels. The forecasts for F 10.7 and S 10.7 prove to be more uncertain and with generally higher biases. Both indices had strong tendencies to over-predict at high solar activity.
The geomagnetic indices, ap and Dst, proved to be difficult to predict even using the two diverse methods. The forecasts for ap are determined by a team of forecasters with the aid of a model, and there was still a low probability of detection for geomagnetic storms. In most conditions however, there was little or no bias in the predictions. The three-day prediction window also ended up being a limitation, and results from a full six-day forecast would be intriguing. The Dst algorithm performed well during G0 (or quiet) conditions but showed unusual trends with increased geomagnetic activity.
A major limitation in this study was the lack of forecasts in certain conditions. This was particularly problematic for the geomagnetic indices, and using the most extreme index value to bin forecasts was used to offset this limitation. Even with this technique, a large percentage of conditions had insufficient data to perform the uncertainty analysis. In the future, we hope to include additional forecasts to the analysis to update the results in order to cover more conditions. This work is intended to provide the community with a performance level for future algorithm and model development in an effort to improve our capability to accurately forecast density and determine satellite trajectories.