Probabilistic Forecasts of Storm Sudden Commencements From Interplanetary Shocks Using Machine Learning

In this study we investigate the ability of several different machine learning models to provide probabilistic predictions as to whether interplanetary shocks observed upstream of the Earth at L1 will lead to immediate (Sudden Commencements, SCs) or longer lasting magnetospheric activity (Storm Sudden Commencements, SSCs). Four models are tested including linear (Logistic Regression), nonlinear (Naive Bayes and Gaussian Process), and ensemble (Random Forest) models and are shown to provide skillful and reliable forecasts of SCs with Brier Skill Scores (BSSs) of ∼0.3 and ROC scores >0.8. The most powerful predictive parameter is found to be the range in the interplanetary magnetic field. The models also produce skillful forecasts of SSCs, though with less reliability than was found for SCs. The BSSs and ROC scores returned are ∼0.21 and 0.82, respectively. The most important parameter for these predictions was found to be the minimum observed BZ. The simple parameterization of the shock was tested by including additional features related to magnetospheric indices and their changes during shock impact, resulting in moderate increases in reliability. Several parameters, such as velocity and density, may be able to be more accurately predicted at a longer lead time, for example, from heliospheric imagery. When the input was limited to the velocity and density the models were found to perform well at forecasting SSCs, though with lower reliability than previously (BSSs ∼ 0.16, ROC Scores ∼ 0.8), Finally, the models were tested with hypothetical extreme data beyond current observations, showing dramatically different extrapolations.


Introduction
Space weather events are ultimately driven by the interaction between the solar wind and the magnetosphere -ionosphere system. These interactions can be characterized as the storage and spontaneous release of energy, leading to intermittent, shorter space weather events (e.g., substorms Freeman et al., 2019), or by the driving of longer extreme space weather events by large-scale structures in the solar wind, such as Coronal Mass Ejections (CMEs) (Illing & Hundhausen, 1983;Gosling, 1993;Richardson & Cane, 2012). CMEs are huge eruptions of plasma from the Sun, exploding into the solar system, often driving fast-forward shocks ahead of them (see review by Webb & Howard, 2012). The impact of CMEs, and their associated interplanetary shocks, on the magnetosphere can drive dynamics and processes such as magnetospheric storms and substorms (e.g., Akasofu & Chao, 1980;Brueckner et al., 1998;Gonzalez et al., 1994;Kamide et al., 1998;Kokubun et al., 1977;Yue et al., 2010;Zhou & Tsurutani, 2001). CMEs associated with high speed interplanetary shocks and significant intervals of southward directed interplanetary magnetic fields (i.e., negative IMF Bz) have been found to be particularly geo-effective (Balan et al., 2014;Echer et al., 2008). Other phenomena can also drive interplanetary shocks, for example, corotating interaction regions (CIRs): structures produced when high speed solar wind interacts with slower moving solar wind in its path (see review by Crooker et al., 1999). However, CIR-driven shocks are not often associated with the same large negative magnetic fields and therefore may drive less severe but longer lasting magnetospheric activity (Alves et al., 2006;Borovsky & Denton, 2006).
On the ground, the impact of a significant interplanetary shock on the magnetosphere can often be inferred through a sharp increase in the northward component of the horizontal magnetic field, known as a Sudden Commencement (or SC) (Araki, 1994;Chree, 1925). If the shock impact is followed by a geomagnetic storm Space Weather 10.1029/2020SW002603 then it may be further termed a Storm Sudden Commencement (or SSC); if it is not then it can be classed as a Sudden Impulse (SI) (Curto et al., 2007;Joselyn & Tsurutani, 1990). It should be noted that the impact of a shock is not always guaranteed to cause large, measurable changes (i.e., an SC) in the magnetosphere (e.g., Echer & Gonzalez, 2004).
The initial impact of some interplanetary shocks and the induced sharp changes in ground magnetic field have been found to generate large currents in power networks, particularly at low latitude and midlatitude (Beland & Small, 2004;Carter et al., 2015;Kappenman, 2003;Marshall et al., 2012;Zhang et al., 2015). Meanwhile, the magnetospheric storms and substorms that may be caused by the shock, or the phenomena driving the shock, are also likely to be associated with large rates of change of the geomagnetic field (and therefore induced currents) (Dimmock et al., 2019;Freeman et al., 2019;Kappenman & Albertson, 1990;Kappenman, 1996;Ngwira et al., 2013;Pulkkinen et al., 2005Pulkkinen et al., , 2012, particularly at higher latitudes where the auroral current systems most often reside (Rogers et al., 2020). Recent work has shown that for the United Kingdom (at midlatitudes) over 90% of extreme geomagnetic field fluctuations occur within 3 days of an SSC (Smith et al., 2019). Interplanetary shocks therefore represent an important source of hazardous space weather, whether directly or in connection with the phenomena that drives the shock. Links have been observed between the properties of interplanetary shocks and the generated geomagnetic (Gonzalez et al., 1999;Oliveira & Raeder, 2015;Tsurutani et al., 1992;Tsurutani & Gonzalez, 1998) and auroral observations (Oliveira et al., 2016). With this in mind it is of great interest to be able to forecast intervals in which infrastructure is at risk, that is, to produce skillful models that are able to accurately forecast the geoeffectiveness of interplanetary shocks.
From an operational perspective, it is possible to automatically identify interplanetary shocks in spacecraft data, for example at L1 (e.g., Cash et al., 2014;Kruparova et al., 2013;Vorotnikov et al., 2008Vorotnikov et al., , 2011. At a minimum, the delay from L1 to the subsolar magnetopause provides appropriately 30-120 min of warning of the shock's arrival at Earth, though the precise delay is variable and not simple to predict (Cash et al., 2016). Further classification at the L1 point to determine the phenomena related to the shock, for example, a CME or CIR, is more difficult to automate but would likely aid forecasts (e.g., Echer et al., 2008). Recent work has had success forecasting whether an interplanetary shock will be followed by an extended interval of southward magnetic field (Salman et al., 2018), as these intervals are known to be particularly geo-effective (e.g., Gonzalez & Tsurutani, 1987).
In the last 20 years, studies have begun to leverage machine learning techniques to forecast space weather; the interested reader is directed to Camporeale (2019) for a detailed overview. In particular, machine learning techniques have been used to forecast geomagnetic indices (cf. Morley, 2020), for example the Kp index (e.g., Ji et al., 2013;Tan et al., 2018;Wang et al., 2017;Wing et al., 2005;Wintoft et al., 2017) and Dst/Sym-H indices (e.g., Bhaskar & Vichare, 2019;Chandorkar et al., 2017;Kugblenu et al., 1999;Lethy et al., 2018;Lundstedt et al., 2002;Wu & Lundstedt, 1996). In this work, we investigate the ability of machine learning methods to provide a probabilistic forecast as to whether an observed interplanetary shock will lead to an SC (e.g., a significant ground magnetic field signature), and further whether this will be followed by a geomagnetic storm (i.e., the shock is related to an SSC). We also examine the performance of these models when presented with inputs at and beyond the limits of the training dataset. As discussed above, the geoeffectiveness can likely be more thoroughly determined through inspection of the physical process driving the shock; however, operationally the details of this structure may not be quantified a priori. Therefore, we investigate whether an SSC can be predicted based solely on information about the shock.
The structure of the paper is as follows. In section 2 we discuss the data and catalogs of events employed, while in section 3 we discuss our methods, models, and metrics. Section 4 describes the results of the study, while section 5 discusses the results in terms of the most powerful predictive parameters, the effectiveness of the parameterization of the shock, and the performance of the models when presented with extreme events.

Data and Catalogs
For this work we require catalogs of interplanetary shocks (measured at L1) and sudden commencements (SCs), which are further classified into sudden impulses (SIs) and storm sudden commencements (SSCs). We note that any catalog may not be fully comprehensive, as marginal events may not be identified. However, it is reasonable to assume that the largest events will be present, and it is these extreme events that are likely

10.1029/2020SW002603
to be the most significant in terms of space weather. For this purpose, we have selected two independent catalogs, described below.
As a part of this study we wish to evaluate the most powerful predictive properties of solar wind shocks, and we therefore extract a wide variety of parameters from the observations of the shock and the surrounding solar wind. Specifically we record the maximum, minimum, range, and mean of the GSM components of the solar wind velocity (V) and interplanetary magnetic field (B), the magnitude of B, and the solar wind proton density, henceforth referred to as features. We do not consider properties that can be further calculated from such features, for example, dynamic pressure or the various coupling parameters, as they strongly correlate with the existing features. The properties of the interplanetary shocks were extracted from data obtained from the ACE and WIND spacecraft within the hour preceding shock arrival at the Earth (as inferred from an increase in the Sym-H index). The 1 hr window was determined empirically in order to include all shocks and account for the different propagation times between L1 and the subsolar magnetopause. Though this scheme will sample different quantities of pre-and post-shock plasma, depending on the speed of the shock, the extracted parameters are found to be dominated by the shock itself and any associated field rotation. Adjusting the interval of time sampled around the shock was not found to significantly change the results.
From ACE, data were used from the Solar Wind Electron, Proton and Alpha Monitor at 64 s resolution (McComas et al., 1998) and the Magnetic Field Experiment at 16 s resolution (Smith et al., 1998). Meanwhile from WIND, data were used from the Solar Wind Experiment at 92 s resolution (Ogilvie et al., 1995) and the Magnetic Fields Investigation at 1 min resolution (Lepping et al., 1995). We note that the data from ACE and WIND can be incomplete, particularly during more extreme solar wind conditions. As it is our goal to demonstrate a method that could be used for operational forecasting, our feature selection method has been chosen such that they may be extracted from incomplete data. Operationally, that is, from a forecasting perspective, it would be important to minimize the effect of any missing data on the predictions made. However, if the data quality were assured and continuous, then it would likely be of significant value to include more detailed parameters of the shock, such as the shock impact angle (Oliveira & Samsonov, 2018). It should also be noted that this study has been performed using science data (i.e., "Level 2 data"), any future operational form should be trained on the more immediately available data products to provide a more representative comparison.

Sudden Commencements
To evaluate the geo-effectiveness of the interplanetary shocks we use an independent database of 417 sudden commencements (SCs), for the same time interval as above, provided by the International Service of Rapid Magnetic Variations (part of the International Service of Geomagnetic Indices) based at Ebre Observatory (http://www.obsebre.es/en/rapid). These events are identified from the data collected by five low-latitude magnetic observatories (Curto et al., 2007). The catalog has been used in the past to statistically assess the consequences of SCs and related activity (e.g., Carter et al., 2015;Fiori et al., 2014;Smith et al., 2019). We further classify the SCs into Sudden Impulses (SI) or Storm Sudden Commencements (SSC) according to the scheme of Fiori et al. (2014), based on the Sym-H index in the days following the observation (cf. Gonzalez et al., 1994). In this scheme the events are classed as an SI if the Sym-H index is greater than −30 nT for the 48 hr following the SC, an SSC if the Sym-H index falls to less than 30 nT within 4 hr of the SC, and as a delayed SSC (SSCd) if the fall to below 30 nT occurs between 4 and 48 hr after the SC. It should be noted that these definitions do not include any recognition of SSCs that may be identified by the changing magnetic "rhythm" of the stations (Mayaud, 1973); however, it is simple and easily reproducible. Table 1 Cross-Comparison IP Shocks (Oliveira, Arel, et al. 2018)

Cross-Comparison
We can cross-compare the events in both catalogs to evaluate whether the interplanetary shocks identified at L1 caused a ground signature that was recognized as an SC, and whether any further activity was observed.
To do so, the SC database was checked for corresponding SCs within ±15 min of each shock impact. This choice of window size was confirmed to correctly match those that would be associated manually. Table 1 shows a comparison between the two catalogs.
First from Table 1 and the perspective of the shock observations, we can see that 240 of the 547 interplanetary shocks in the interval (44%) did not cause a significant ground signature, that is, an identified SC. This relatively large fraction is expected: An interplanetary shock impact is not a sufficient criterion for an SC to be observed (e.g., Echer & Gonzalez, 2004). Meanwhile, 94 out of 125 (75%) SSCs can be directly related to solar wind shocks. This is entirely consistent with the findings of a shorter time interval studied by Wang et al. (2006). From an similar perspective, while 94 interplanetary shocks (out of 547, 17%) directly correspond to SSCs, a total of 261 (out of 547, 48%) can be linked to some form of storm activity (i.e., SSCs or SSCds). This is similar to the observations of Echer and Gonzalez (2004), who found that between 1973 and 2000 57% of their 574 interplanetary shocks were followed by a Dst below −50 nT in the following 3 days.
Meanwhile from the perspective of SCs, 107 of the 414 (25%) SCs recorded during this interval do not correspond to an identified interplanetary shock at L1. This is very similar to the fraction found by Wang et al. (2006) in their study between 1995 and 2004; such events have been attributed to other phenomena in the solar wind (e.g., Park et al., 2015;Tsurutani et al., 2011). Additionally, it is possible that there are data gaps due to extreme plasma conditions at L1 which prohibit the identification of a counterpart shock. This would likely preferentially effect more extreme solar wind conditions. On the other hand, it is also possible that some of the SCs that lack an interplanetary counterpart could be due to the independent nature of the catalogs. It is often the case that when surveying data manually, smaller scale events can be more noticeable when the time period is highlighted by the occurrence of a related phenomenon. With regard to the SSC and SSCds that have an unidentified interplanetary cause, it should be noted that our statistics will include relatively small storms due to our adoption of a low threshold of 30 nT. Previous works have suggested a very strong association between large storms and interplanetary shocks (e.g., Chao & Lepping, 1974;Gosling et al., 1991). Interestingly, if we increase our storm threshold to −50 nT or even −100 nT we still return 22 − 25% of SSCs that do not correspond to an identified interplanetary shock. This suggests that a combination of the factors described above may be responsible.
Finally, we note that if we restrict the study to those events with sufficient in situ L1 data to provide the required features, the number of events reduces down to 93 (from 94) interplanetary shocks related to SSCs (with prompt storm activity within 4 hr) and a total of 303 (from 307) interplanetary shocks that are related to a detected ground signature (SCs). This marginally reduced catalog will form the database for the remainder of the study.

Methods, Models, and Metrics
In this work we demonstrate the use of machine learning models to provide probabilistic estimates as to whether observed interplanetary shocks at L1 will be geo-effective, defined in section 2 as causing an identifiable sudden commencement (SC), and also by whether they will be related to a geomagnetic storm (or SSC). In this section we outline the feature selection methods, the machine learning models and metrics used to evaluate the performance of the models.

Feature Selection
There are a large number of solar wind and IMF properties obtained at L1 that could possibly be used to describe an interval of data around an interplanetary shock. However, not all of these variables will be useful to the models; there will be some that correlate with each other, and some may be confounding variables (e.g., Bentley et al., 2018). A correlation matrix of the solar wind features is presented and discussed in Appendix A. Additionally, the inclusion of a large number of features may result in overfitting, where the model overlearns from the training examples and then is not able to extrapolate to future or unseen events. For this reason, we evaluate which features of the shock are the most useful to the models.
For this study we have chosen to extract the feature importances using a random forest (ensemble) model. The importance of each feature is a measure of how the Gini impurity or information (entropy) is changed by placing requirements upon that feature. It should be noted that this method can have unexpected results when scoring features that have large differences in scales (e.g., Strobl et al., 2007). To minimize the effect of this on our results, we have chosen to standardize the values of each feature using the mean and standard deviation. Additionally, it should be noted that if two features are correlated then the first selected will have the original importance, while the second will have a significantly reduced importance: Its subsequent addition will not significantly improve the forecast. This is despite the fact that both features may individually be good predictors, as would be expected if they correlate strongly. Therefore, an interpretation should be careful to note that the relative feature importance does not necessarily represent the individual skill of each feature in isolation, but instead the combination of features that best contribute to making classifications.
Two other feature evaluation methods were considered, the F Score and Mutual Information (MI) of the features. However, the F Score only measures linear relationships and does not account for correlations between features (Chen & Lin, 2006) and was therefore found to give poorer results. The MI method provided almost identical results to the ensemble feature importance method, but does not easily provide an uncertainty estimate, and therefore, it was not selected over the chosen method.

Models
In this study, we develop a series of forecasts using relatively simple machine learning techniques such that these forecasts can be run in near-real-time to forecast the consequences of an interplanetary shock that is observed at the L1 point. All models are available from the scikit-learn python package (Pedregosa et al., 2011). In section 5 we will also consider hypothetical or extreme interplanetary shocks, for example, historical events or those that may be predicted, perhaps through ballistic or MHD heliospheric models.
First, we consider a simple linear model. Fitting this model on a single feature would be the equivalent of scaling the probability of an "event" with the value of the parameter, providing a useful benchmark. The model is based on Logistic Regression, a linear technique where the probability of each outcome is modeled with a logistic function (or sigmoid curve) (Hosmer & Lemeshow, 2005).
We also consider two nonlinear models to examine whether the interplanetary shocks can be better evaluated in this way. First, we test a Gaussian Process classifier, a non parametric model that uses a Bayesian approach, assuming a prior distribution on the underlying probability densities (Rasmussen & Williams, 2006). It has the advantage that it natively provides a probabilistic result; however, it is known to be relatively computationally intensive when applied to high dimensional data sets. The second contrasting nonlinear method tested is a Gaussian Naive Bayes model, based on applying Bayes' theorem while assuming conditional interdependence between features, given the assigned class (Pérez et al., 2006;Webb et al., 2011). In this formulation, the likelihood of the features is assumed to be Gaussian. This model has been included as it has been shown to perform well as a classifier with a relatively small training set. However, the probabilities that it assigns are known to be unreliable (Zhang, 2004), and we therefore re-scale the returned probabilities in order to report better calibrated and reliable forecasts using Platt Scaling (Platt, 1999). Platt scaling involves fitting a parametric logistic regression model to the output of the model. Essentially, this scaling enables the correction of the initial returned probabilities into a more reliable output. The sklearn package provides this feature as a part of the CalibratedClassifierCV class.
Finally, we consider an ensemble model based on decision trees. Each tree divides the parameter space into "leaves," with each split designed to maximize separation of the distinct classes (Breiman et al., 1984). This method is highly nonlinear and is known to be susceptible to overfitting. Therefore, Random Forests average the results of multiple independently derived "trees," thereby reducing the variance of the model and overfitting (Breiman, 2001;Ho, 1995). Probability estimates are obtained from the ensemble of predictions made from the different trees. To calibrate the returned probabilities we once more perform Platt scaling (Platt, 1999).
The Logistic Regression and Random Forest models (in particular) have several hyper-parameters: internal model parameters used during training that can be tuned to provide superior model performance. In this work we apply a simple grid search method to optimize these parameters and improve model performance.
To create a completely optimal form of these models a more detailed optimization of these hyper-parameters could be performed.

Cross-Validation
We apply a k-fold cross-validation procedure to ensure that the model results are not specific to a selected set of training data and can be generalized. This procedure involves splitting the data into k groups; using k − 1 groups to train the model and the remaining group to validate the model results (Kohavi, 1995;Schaffer, 1993;Shao, 1993). Specifically, we apply a stratified k-fold (with four folds) to examine the results of the models. The stratification ensures that the classes are evenly represented in all folds.
More generally, it is good practice to test model performance on a portion of data that was initially withheld from the model (i.e., a validation set). However, in this case we report the results of the k-fold cross-validation due to a limited and imbalanced data set, particularly when considering SSCs. Partitioning a validation set in this case returned metrics that varied considerably between distinct model runs, due to the relatively poor quantity of data and random selection processes. Given our chosen method of reporting, the training and test data are not fully independent, and therefore, the uncertainty in the model metrics is likely to be underestimated.

Baseline and Metrics
In this study, we assess the performance of the models using two standard probabilistic forecast verification metrics: the Brier Skill Score and the ROC score. These metrics were developed for assessing terrestrial weather forecasts but have also been used in a variety of space weather contexts (Azari et al., 2018;Crown, 2012;Forsyth et al., 2020;Murray et al., 2017).
The Brier Score (BS) is a measure of the mean square of the probability error (Brier, 1950), calculated using Equation 1: where N is the number of observed events, p i is the forecast probability (between 0 and 1), and a i is the observation (1 = occurred, 0 = did not occur). The Brier Score is measured between zero (for a perfect forecast) and one (for a completely incorrect forecast). In this study we are interested in comparing the skill of the model as compared to a baseline prediction, and so we calculate the Brier Skill Score (BSS), which shows the improvement of the BS of the model compared to the BS obtained by a reference prediction (climatology, or overall/total probability for example), calculated using where BS Clim is the BS obtained by the climatology and BS Model is the BS obtained by the model under investigation. The BSS will be maximized for a perfect forecast and will equal 1, while negative values indicate that the forecast is worse than simply using the climatological forecast.
Reliability diagrams graphically describe the accuracy of a probabilistic forecast, displaying how well the forecast probability of an event corresponds to the actual chance of observing the event. The frequency of the observations is plotted against the frequency of the forecast probability. A perfectly reliable forecast would lie along the diagonal line of gradient unity; for example, a forecast of 30% would correspond to an observation of the event 30% of the time. The comparison between the diagonal and the reliability curve provides a measure of how reliable the model results are. However, it is worth mentioning that a consistently "unreliable" model can be re-calibrated once its reliability curve has been assessed, as described above (cf. Platt, 1999).
Finally, ROC curves describe the ability of a forecast to discriminate between events and non-events (the skill of the model), something that is not tested by the reliability diagram and the Brier Skill Scores. An ROC curve is a plot of the false alarm rate against the hit rate for a series of forecasts in which the parameter which determines a positive or negative forecast is varied (Swets, 1988). In this study, we set a threshold for the forecast probability and vary this to generate the ROC curve. In theory, as the probability threshold is increased a skillful forecast would show an increasing number of "hits," while the number of false positives would grow more slowly. This curve would lie close to the top left of the plot, such that the hit rate approaches 100% while the false positives remain low. To quantify this behavior the area under the curve may be evaluated, known as the ROC score, and is measured between 0 and 1 (Zweig & Campbell, 1993). A perfect forecast will be described by a point at the top left of the plot, maximizing the hits without incurring any false alarms, and return an ROC score of 1. In contrast, if the probability of an event is random then it would be expected that the false positives would grow at the same rate as the hits, and so a score of 0.5 corresponds to a model with low skill. Therefore, a skillful model will show a ROC score greater than 0.5, ideally approaching 1.

Results
In this work we apply the models described in section 3.2 to make probabilistic forecasts as to whether an interplanetary shock will be geoeffective. We define "geoeffective" in two ways, first, does the interplanetary shock generate an independently identifiable ground signature: a Sudden Commencement (SC). Second, does the interplanetary shock precede a geomagnetic storm, that is, a Sudden Storm Commencement (SSC).

Forecasting SCs
We first test the ability of the models to provide a probabilistic forecast as to whether an SC will be observed on the ground, using information derived from data in the interval around the interplanetary shock. As discussed in section 2, out of the 547 interplanetary shocks in the data set, 307 result in an independently identified ground signature. Excluding those events for which insufficient solar wind data are available limits the data set to 540 interplanetary shocks, 303 (56%) of which are related to SCs. Therefore, our climatological forecast is 56%, providing a Brier Score of 0.25.
As discussed in section 3.1, we use a random forest classifier to rank the importance of each of the 32 extracted solar wind features. Figure 1 shows the relative importance of the top 10 features, normalized to the most important, where the uncertainty is the standard deviation from the ensemble of estimators. The most powerful predictive parameter can be seen to be the range of the magnetic field magnitude, while the ranges in density and velocity (X GSM component) are the second and third. The top few features are shown to score highly, while the feature importance quickly drops beyond this. It should be noted that if the range of the solar wind dynamic pressure is included then this becomes the second most important parameter, with a similar importance to the range in density (n p ) shown in Figure 1.
We wish to optimize the number of features that we provide the models, we do this by adding each of the top 10 features, in order of their importance from Figure 1. Figure 2 shows how the addition of features changes the Brier Skill Scores and the ROC scores for each of the different models. All models initially show an increase in both metrics (representing reliability and skill respectively) following the addition of features, however the benefit of additional parameters is seen to plateau at around three features. Figure 3 shows the results of the models when provided with the top three features. Figure 3a shows a reliability diagram, presenting how the forecast probabilities correspond to the observations. As discussed in section 3.4, a perfect probabilistic forecast will lie along the diagonal dotted line of gradient unity. The horizontal and vertical dashed lines indicate where climatological forecasts would lie. All four models can be seen to fairly well correspond to the (diagonal) perfectly reliable forecast, and this is further quantified in Figure 2d, where the Brier Skill Scores (BSSs) are presented, compared to climatology. Also included is the BSS obtained by a simple linear 1-dimensional Logistic Regression model, equivalent to scaling the probability with the most important parameter: the range in magnetic field magnitude in this case. All four models return very good BSSs, outperforming both climatology and the 1-D Logistic Regression model, returning BSSs between 0.26 and 0.32. The Gaussian Process model is shown to provide the most reliable results, being the only model to consistently score above 0.3. For context, as the climatological Brier Score is found to be 0.25, a model BSS of 0.3 corresponds to a Brier Score of 0.175. Figure 3b analyzes the ROC curves, showing the false positive rate plotted as a function of true positive rate (described in section 3.4). The line of no skill is presented as a dashed diagonal line. All four models show ROC scores between 0.8 and 0.83, representing skillful forecasts. The worst performing models in terms of both BSS and ROC scores are found to be the Naive Bayes and Random Forest models, perhaps due to the specifics of their non-linear methods as well as the requirement (and only partial success) of the process to re-calibrate their probabilities (discussed in section 3.2). The Random Forest model in particular may be showing a tendency to over-fit to the training data.

Forecasting SSCs
Building on the above, we assess the ability of the selected models to provide a probabilistic forecast of whether an observed interplanetary shock will be followed by a geomagnetic storm, that is, a geomagnetic storm is observed within 4 hr of shock impact (cf. Fiori et al., 2014). The dataset for this totals 540 interplanetary shocks, of which 93 (17%) were followed by a geomagnetic storms. The climatological forecast is therefore 17%, which returns a baseline Brier Score of 0.14.
As above, we evaluate the 32 solar wind features extracted from the interval around the shock observation using a random forest classifier. The relative importance of the top 10 features is presented in Figure 4. While the range of the field (B Range ) still ranks highly, in contrast to Figure 1 we see that the most important parameter is now the minimum B Z observed in the interval around the shock. The change in velocity ranks as the third most important in forecasting whether a shock will be related to an SSC. In addition, if the range in dynamic pressure is also included, it ranks as the seventh most important parameter, though this should be noted with the caveat that it does correlate with other features ranked as highly as third most important.  Figure 5 shows how the addition of parameters in the order suggested by the random forest feature importances ( Figure 4) changes the BSS and ROC score. As with the forecasting of SCs above, the inclusion of a second feature increases the skill of the models; however, the addition of more features is not as beneficial, and the scores can be lower. The variations between the folds can be seen to be quite substantial (from the error bars), which is likely a result of the relatively small number of positive events with which it is possible to train the models. Figure 6 shows the results of the models when provided with the top 4 features, approximately maximizing the BSSs and ROC scores from Figure 5 for most models. The relatively small number of positive events can be clearly seen in Figure 6c, where the histogram of predicted probabilities is strongly dominated by low values. However, the reliability of the predictions, displayed in Figure 6a and quantitatively assessed in Figure 6d, can still be seen to clearly outperform climatology. The ROC plots and scores presented in Figure  6b also show good scores, above 0.8 for three of the models.
Interestingly in Figure 6d, and in contrast to the SC forecasting in Figure 3d, the addition of more parameters for several of the methods does not seem to provide a strong improvement over the use of a single parameter and a linear method (1-D LR). This could be due to higher dimensional models overfitting the relatively sparse data available. This is particularly the case for the highly nonlinear Random Forest model, which can be seen to give comparatively low BSS and ROC scores. However, it should be noted that in Figure 5 the Random Forest model does appear to benefit more than the other models from the inclusion of additional features. The best models in terms of both ROC scores and BSSs were the Linear Regression and Gaussian Process models with as little as two parameters, with ROC scores and BSSs exceeding 0.82 and 0.21, respectively (e.g., Figure 5). The climatological Brier score was 0.14 for this configuration, and so a BSS of 0.21 represents a Brier Score of 0.11.

Discussion
Our results show that simple machine learning techniques can provide models that skillfully and reliably forecast the occurrence of both SCs and SSCs from the properties of interplanetary shocks. However, it is also useful to consider the performance of these models when the available input data is limited or when the features used exceed the limits of the training dataset.

Most Powerful Predictive Parameters
Figures 1 and 4 evaluated the importance of each of the 32 extracted solar wind features around the shock identifications; we will now discuss and interpret the results in their physical context.

Sudden Commencements
The northward deflections of the horizontal geomagnetic field from which sudden commencements are identified are often modeled as a combination of two main components: a step-like function at low latitudes and a two-pulse structure that dominates at higher latitudes (Araki, 1977(Araki, , 1994. Observations at low latitudes have found that the magnitude of the low latitude perturbation scales with the square root of the change in solar wind dynamic pressure (e.g., Russell et al., 1992). The SCs used in this study were identified from a series of low latitude magnetic observatories, and it might therefore be expected that the most pow- erful predictive features would be the features that constitute the change in dynamic pressure: When the range in pressure is larger, it would be expected to generate a more significant ground signature that may be more likely to be identified. However, the ranges of density and velocity are only the second and third most powerful features in Figure 1. Indeed even when explicitly included, the range in dynamic pressure only ranks as the second most important parameter.
The most powerful feature is found to be the range in |B|. It is possible that the range in |B| may serve to distinguish between the phenomenon that is driving the shock. For example, it may be that CME-driven shocks more often display large changes in |B|, compared to those driven by CIRs, while the effect of the distinct phenomena at the Earth is known to vary (Oliveira & Samsonov 2018;Richter et al., 1985;Smith & Wolfe, 1976;Tsurutani et al., 2006). Another consideration is the combination of parameters that may be used to define a shock. In the frame of the shock, the ratio of upstream and downstream field, velocity, and density should provide the necessary information to evaluate the size of each shock (Hugoniot, 1887(Hugoniot, , 1889Rankine, 1870). As the values provided to the algorithms are not in the shock frame, it may be that the change in velocity is not as good a descriptor of the shock as the field and density changes. This may also explain why the addition of parameters beyond the first few is ineffective, once the shock is satisfactorily defined, little information of additional value can be included. From a physical perspective, however, the appearance of the range in B Z as the fourth most important parameter may indicate that it is effective in discriminating between perpendicular and parallel shocks, which may be important (e.g., Jurac et al., 2002). Nevertheless, the fact that the change in B ranks as the most important parameter highlights the importance of forecasting the nature of the magnetic field upstream of the Earth.
It is also notable that the ranges in the solar wind parameters are selected as important and not the maximum values. This may be explained by the close relationship between the range of a parameter and its ratio, which would define the size of the shock (e.g., Hugoniot, 1887Hugoniot, , 1889Rankine, 1870). Additionally, this could correspond to observations that the nature of the upstream solar wind into which the shock is propagating is also important (e.g., Liu et al., 2014;Riley et al., 1997), as the ranges of these parameters would effectively distinguish the greater geoeffectiveness of a shock propagating into a more tenuous region.

Storm Sudden Commencements
One of the main differences between the feature importances when forecasting SCs and SSCs (Figures 1  and 4) is the dominance of the minimum value of B Z when considering SSCs. It has long been thought that the southward component of the IMF is strongly associated with ground magnetic disturbances (e.g., Fairfield & Cahill, 1966). Notably, Echer et al. (2008) found that all 90 intense geomagnetic storms (Dst < −100 nT) within their dataset were associated with strong southward B Z . Therefore, it was expected that it would be a significant parameter. Additionally, strong correlations have been observed between the Dst index and B Z (e.g., Burton et al., 1975;Temerin & Li, 2002) or the related dawn-dusk electric field (e.g., Ontiveros & Gonzalez-Esparza, 2010;Wang et al., 2003). This work again highlights the importance of forecasting the orientation of the magnetic field.

Parameterization
The extraction of features from the solar wind around the shock identification is deliberately simple to minimize the effect of missing and incomplete data. However, it is likely that more complex fitting or parameter extraction would increase the effectiveness of the models. For example, recent work has shown the importance of the orientation of the shock front, both through the use of MHD models (Oliveira & Raeder, 2014) and also direct observations (Oliveira & Raeder, 2015).
While more complex characterization of the shock itself may help, obtaining parameters related to the phenomena driving the front may also aid prediction. At the longest lead time, remote observations of CMEs have been shown to be useful for inferring future consequences (Kim et al., 2010), especially when combined with empirical limits on conditions in near-Earth space (Kim et al., 2014). More detailed in situ analysis of CMEs, such as fitting of their structure, has also been shown to be useful (e.g., Kang et al., 2006). In addition, the properties of the sheath between the CME and the shock have been found to be important to determine the magnitude of the interaction with the Earth's magnetosphere (Kilpua et al., 2019). The sheath itself has been inferred to drive 25-50% of intense geomagnetic storms (Richardson et al., 2001;Tsurutani et al., 1988), through several mechanisms (Lugaz et al., 2016). It should be noted that the chosen window of data, 1 hr prior to shock impact, will include some sheath observations. However, with our method the most powerful predictive features are dominated by the ranges in solar wind parameters, and therefore by the shock itself (with some exceptions, for example, the minimum B Z ).
Correlations have been noted between the different regions of the interplanetary phenomena, which may aid the forecasting of SSCs using the methods in this study. For example, correlations have been observed between the properties of CME sheaths and their shocks (Lindsay et al., 1994;Tsurutani et al., 1988), as well as the shocks and the magnetic structure that follows (Lepping et al., 2001;Luhmann, 1997). Additionally, a 30-min interval around the shock has been suggested to be useful in predicting the occurrence of long duration southward B Z (Salman et al., 2018). Therefore, though information about the larger scale phenomenon driving the shock has not been included directly, it may correlate with the features provided.

Magnetospheric Information
While the results in Figure 6 show good skill at forecasting SSCs compared to climatology, there is not a large increase with the addition of parameters (i.e., Figure 5), perhaps as all relevant information about the shock has already been included. In fact, the Brier Skill Scores reported for the models mostly do not significantly improve on the benchmark of a 1-D Logistic Regression model (Figure 6d). There are several reasons additional factors that could explain this behavior. First, the extent of the catalogs may not be extensive enough to provide adequate coverage for multidimensional, nonlinear models. Second, it is possible that the models need to include some indication as to the state of the magnetosphere, as this may have a strong influence. Third, it is also possible that the method by which the features have been extracted from the solar wind around the shock impact are insufficient to quantify the nature of the coupling to the geomagnetic field, both initially during the shock and also for any structure that follows (e.g., coronal mass ejection, sheath region, etc.).
To investigate the second and third points, we can explore adding information about geomagnetic indices, providing an estimate of the current magnetospheric conditions and to see if the coupling is adequately captured with the properties extracted from the solar wind data. Similar to the treatment of the solar wind parameters, we extract the minimum, maximum, range, and mean of several magnetospheric indices: AU, AL, AE, Sym-H, and Sym-D in a 1 hr window prior to shock impact, extending to 15 min after the initial impact. This does mean that we are including any potential SC signature, and therefore, we only test the forecasts of SSCs. Figure 7 shows the top 10 parameters from the total of 52 possible features.
It is notable from Figure 7 that the top 4 parameters are all related to the magnetospheric indices, with most corresponding to the initial response of the magnetosphere (i.e., the range of several indices). This would suggest that the simple shock parameterization that has been employed has not captured some of the important information relating to the interaction of the shock and magnetosphere. However, as noted in section 3.1, the fact that the feature representing the minimum B Z is not present (compared with Figure  4) does not mean it is no longer important at all. It is likely that while the magnetospheric indices are more powerful predictive parameters, they correlate strongly with the minimum B Z and therefore reduce the returned importance of the B Z feature. Figure 8 shows the performance of the models, when provided with the top 5 parameters displayed in Figure  7. The top 5 have been used as this was the point at which no significant additional skill was obtained by adding more features. All four models are shown to once more provide reliable forecasts compared to climatology as shown by the reliability diagram and BSSs (Figures 8a and 8d). The models are also shown to provide good skill, with ROC scores above 0.8. The increase in reliability achieved through the addition of magnetospheric indices amounts to modest increase in BSSs scores of up to 0.05, while the increases in ROC scores are also relatively small at ≤0.02, and not present for all models considered. Therefore, we may conclude that the coupling of the interplanetary structure to the magnetosphere is not completely captured by the simple parameterization employed. The presence and significance of the minimum Sym-H index also suggests that the state of the magnetosphere and ring current is important. However, only a relatively small increase in skill can be achieved through addition of magnetospheric indices.

Longer Lead Times
When forecasting shocks and associated phenomena at large lead times, for example, from heliospheric imaging, there are certain parameters that may be more accurately predicted. For example, it may be possible to more accurately forecast the velocity (Barnard et al., 2019;Byrne et al., 2010;Davies et al., 2012Davies et al., , 2013Kahler & Webb, 2007) and density (Barnes, 2020) of a propagating CME, while its internal structure and magnetic field are much more challenging (e.g., Kilpua et al., 2019). Recent work has also confirmed that accurately forecasting CME velocity provides useful information about their geoeffectiveness, greatly increasing the value of such forecasts (Owens et al., 2020).
It is therefore useful to assess how the models perform when provided with a more limited dataset, corresponding to the parameters than may be predicted with a greater accuracy. For this test, we limit the input features to those related to the velocity and density observed at L1. Figure 9 shows the relative importances of each feature, evaluated once more with the ensemble method. Figure 9 shows that features associated with the velocity appear to provide the greatest predictive power.
However, Figure 10 shows the change in metrics with the addition of multiple features, and we can see that including more than one feature does not aid the majority of the models. For the ensemble (RF, Random Forest) model including up to five features increases the skill, but in general, both ROC and BSS metrics are lower than achieved for the other models. The Gaussian Process and Logistic Regression models give the best BSSs (∼0.16) and ROC scores (∼0.8) with around five features and outperform climatological forecasts. These scores are substantially lower than were achieved with the inclusion of parameters related to the field. This highlights the need to understand and be able to predict some of the magnetic field parameters associated with an incoming interplanetary shock and related structures (e.g., Gosling et al., 1991;Huttunen et al., 2005;Richardson & Cane, 2012;Zhang et al., 2007).

Extreme Events
The most damaging Space Weather events are those that are extreme; they pose the largest risk to infrastructure and have the potential to have the largest societal consequences. Arguably, the most extreme event recorded in history was the Carrington event in 1859 (Carrington, 1859;Hayakawa et al., 2019;Tsurutani et al., 2003). MHD modeling has suggested that the ground electric fields during this event would be twice as large as during the most extreme geomagnetic storm in the modern era (the March 1989 storm) (Ngwira et al., 2014). In the space age, the fastest CME ever recorded was observed in August 1972 (Hoffman et al., 1975), causing the unexpected detonation of sea mines during the Vietnam war (Knipp et al., 2018). More recently, an interplanetary CME inferred to be similar to the Carrington event, potentially providing a worse-case scenario, was fortunately not Earth-directed but was observed by the STEREO spacecraft in July 2012 (Baker et al., 2013;Ngwira et al., 2013). It is worth noting that the upstream conditions ahead of this CME were highly atypical (Russell et al., 2013). As such CMEs are so rare, and their parameters so unusual, it is interesting to consider how the models presented in this work would react to such an input.
To test this, we can explore how the models react to data provided above the range of the training data. We initially limit the models to the top two most effective features (shown in Figure 4) in order to simplify the visualization. The models are then provided with a grid of hypothetical events, ranging between the minimum of each feature and twice the maximum (calculated for −B Z minimum). Figure 11 shows the distribution of predictions made by the four different models. The orange contours show the distribution of training data corresponding to the shocks resulting in SSCs, while the red contours shows those that were not related to SSCs. The contours were created using kernel density estimation (Scott, 2015). The red contours representing "non-geoeffective" shocks can be seen to be mostly tightly bunched in the lower right of the panels. Meanwhile, the orange contours representing shocks related to SSCs can be seen to be broader and extend towards larger ranges in B.
The Logistic Regression and Naive Bayes models (Figures 11a and 11b) can be seen to asymptote to high probabilities outside of the training region. In contrast, the Gaussian Process and Random Forest models (Figures 11c and 11d) can be seen to have more structured behavior outside of this region, and both tend to be more conservative in their predictions. Without training data at the more extreme values it is not clear which behavior would correspond to better predictions, however, it is very important to take the models response into account when reporting predictions.
It is also notable that for very fast CMEs, for example, the August 1972 event (Hoffman et al., 1975), the transit time from L1 to the Earth's magnetopause will be relatively short. Therefore, the 1 hr window prior to impact will encompass a larger than normal portion of the CME sheath, perhaps effecting the derivation of the features used in this method. The consequences of this would have to be be carefully evaluated for any potential operational implementation.

Summary and Conclusions
We have developed four different models to forecast the consequences of interplanetary shocks on the magnetosphere using simple machine learning techniques. These forecasts perform well at predicting whether the shock will lead to an SC or SSC. We tested a linear model (Logistic Regression), two nonlinear models (Naive Bayes and Gaussian Process), and an ensemble model (Random Forest). The interplanetary shocks are simply parameterized by their maximum, minimum, range, and mean of observable solar wind and IMF properties at L1. This scheme was chosen as it is resilient to missing data and does not require manual or complex processing to be performed.
All four models provided skillful and reliable forecasts as to whether a shock would result in the identification of an SC on the ground, with Brier Skill Scores (BSSs) of between 0.26 and 0.32 and ROC scores between 0.8 and 0.83. This outperforms both climatology, which provided the reference to which the BSSs were calculated, and a simple 1-D linear model. Overall, the Gaussian Process model returned the greatest reliability, while both the Gaussian Process and Logistic Regression models showed the best skill. Meanwhile, the ensemble method showed indications that it may tend to over fit the available data. The most powerful predictive power was found to be the range in the magnetic field observed, followed by the ranges of density and velocity. This highlights the importance of forecasting the magnetic field upstream of the Earth. Physically it is possible that the range in the magnetic field serves to distinguish between the phenomena driving the shock, or that it provides a more robust definition of the shock than the change in velocity or density. The fact the ranges of each parameter were shown to provide greater performance (rather than the maximums, for example) confirms that the conditions into which the shock are propagating are important.
When forecasting whether a shock will be related to an SSC all four models provided skillful results (with ROC scores exceeding ∼0.78), while significantly outperformed climatology (with BSSs of ∼0.21). However, when presented with multiple features, the reliability of all four models were comparable to that obtained with a more simple one-dimensional Logistic Regression model (BSS ∼ 0.19). Once more, the Logistic Regression and Gaussian Process models were shown to provide slightly greater skill and reliability than the other two models, with the ensemble Random Forest requiring more parameters to achieve comparable performance. The most powerful predictive parameter for SSCs was shown to be the minimum value of B Z associated with the shock, confirming the importance of the orientation of the IMF when evaluating the effectiveness of a shock.
The chosen parameterization scheme was then tested by the inclusion of magnetospheric indices and their properties, describing the state of the magnetospheric system as well as the initial impact of the shock and sheath. These added features were found to give moderate improvements to the reliability of the models, suggesting that the simple parameterization does not fully capture the coupling of the solar wind and magnetosphere and that an indication of the magnetospheric state is important.
For longer lead times, properties of the magnetic field around interplanetary shocks are challenging to predict, we therefore tested excluding these features from the models results. This more limited data set resulted in decreased predictive performance. However, when solely relying on properties associated with the velocity and density, all models still outperform climatology and provide reliable and skillful predictions (BSSs ∼ 0.16, ROC scores of 0.8). Once more, the Logistic Regression and Gaussian Process models were shown to provide the best relative reliability and skill.
Finally, the models were provided with extreme hypothetical data with conditions beyond that with which they were trained. The responses of the different models were shown and contrasted. Both the Logistic Regression and Naive Bayes models were shown to quickly asymptote to high probabilities outside of the training data, while the Gaussian Process and Random Forest showed much more structured behavior that was heavily dependent on the data at the edge of the training space. The extrapolation of each model therefore needs to be carefully considered when applying such models to extreme data.