Solar wind prediction using deep learning

Emanating from the base of the Sun's corona, the solar wind fills the interplanetary medium with a magnetized stream of charged particles whose interaction with the Earth's magnetosphere has space-weather consequences such as geomagnetic storms. Accurately predicting the solar wind through measurements of the spatio-temporally evolving conditions in the solar atmosphere is important but remains an unsolved problem in heliophysics and space-weather research. In this work, we use deep learning for prediction of solar wind (SW) properties. We use Extreme Ultraviolet images of the solar corona from space based observations to predict the SW speed from the NASA OMNIWEB dataset, measured at Lagragian point 1. We evaluate our model against autoregressive and naive models, and find that our model outperforms the benchmark models, obtaining a best-fit correlation of 0.55 $\pm$ 0.03 with the observed data. Upon visualization and investigation of how the model uses data to make predictions, we find higher activation at the coronal holes for fast wind prediction ($\approx$ 3 to 4 days prior to prediction), and at the active regions for slow wind prediction. These trends bear an uncanny similarity to the influence of regions potentially being the sources of fast and slow wind, as reported in literature. This suggests that our model was able to learn some of the salient associations between coronal and solar wind structure without built-in physics knowledge. Such an approach may help us discover hitherto unknown relationships in heliophysics datasets.


Introduction
Space weather is defined by the U.S. National Space Weather Plan as the conditions on the sun, in the solar wind (SW), and within Earth's magnetosphere, ionosphere and thermosphere that can influence the performance and reliability of space-borne and ground-based technological systems and can endanger human life or health (NASA, 2017). The solar wind that influences space weather comprises a continuous stream of magnetized charged particles emanating from the base of the solar corona and permeating the solar system (Schwenn, 2006). The influence of the SW on space weather arises due to its interaction with the Earth's magnetospehere, resulting in geomagnetic storms and aurorae. Such terrestrial effects of the SW make it extremely important to understand and identify its possible sources, and perform a prediction as a forewarning against potential damages. Owens et al. (2008) review SW prediction using empirical, physics-based, and hybrid approaches. Typically, a physics-based model uses synoptic magnetograms as the bottom boundary condition. An individual synoptic magnetogram is assembled by sampling the photospheric magnetic flux distribution near the central meridian over the course of a solar rotation. Such magnetograms can be used to extrapolate the surface field into the corona using potential-field source-surface (PFSS) (Altschuler & Newkirk, 1969) models or magnetohydrodynamics (MHD) models (see Riley et al., 2006, for a comparison between the two). The global coronal magnetic field (or certain derived properties thereof) may then be used as input for physics-based SW propagation models (e.g. Linker et al., 1999), or in the case of a hybrid approach, used for estimation of the SW at L1 using empirical relations. WSA-ENLIL and MAS-ENLIL (Owens et al., 2008;Schwenn, 2006) are among the most widely used SW models. The models provide SW properties such as the velocity, plasma density, magnetic field and temperature. Jian et al. (2015) perform a comparison of the various SW models and present a best correlation of 0.57 on hourly prediction using GONG-MAS Thermo-ENLIL model, and 0.50 on the same dataset using GONG-WSA-ENLIL model.
The dependence of high-speed streams on the presence of Coronal Holes (CHs) was first shown in Krieger et al. (1973), using extrapolation methods. Krieger et al. (1973) interpreted this empirical relation as a manifestation of high-speed SW flowing along open magnetic flux regions. Wang and Sheeley Jr (1990) used the inverse of flux tube expansion factor for SW prediction, by identifying an inverse correlation between the flux tube expansion factor and the SW. Recently, there has been a surge in performing regression using fractional (CH) area extracted from EUV imagery data (Rotter et al., 2012;Rotter et al., 2015;Temmer, Manuela et al., 2018), and a correlations from ≈ 0.60 to ≈ 0.78 have been obtained between the CH areas and hourly solar wind speed. More recently, Yang et al. (2018) devised a Neural network based prediction scheme, taking PFSS model output among other parameters as input, and obtained a correlation of 0.74 on hourly solar wind speed data.
In machine learning (ML) and statistical-learning parlance, the aforementioned traditional empirical models use so-called hand-engineered features as input for their models (e.g. CH area or CH expansion factor). These hand-engineered features are often inspired by some insights from physics-based models, or simply from correlations reported in the literature. Deep learning (DL) is an umbrella term for a broad class of techniques that use neural networks with multiple hidden layers for performing supervised or unsupervised learning tasks. Due to the increased availability of data, and perhaps more importantly, of inexpensive computation, DL has been widely applied in many domains. In areas of engineering and science dealing with ambiguous features, DL has been found to outperform ML algorithms that use hand-engineered features (Goodfellow et al., 2016). Generally, DL involves solving supervised learning tasks such as classification (associating a particular label with a given set of inputs, such as in image classification) or regression (continuous-value prediction). Often, these algorithms need no prior information regarding the exact input-output mapping, but instead try to discover underlying relations by iteratively updating the model parameters by minimizing a loss function (e.g. mean-squared error). Prior information can be built into the model in a number of ways, e.g. (1) by providing hand-engineered input features (which are generally physics-based), (2) assembling a neural net with some layers that have been pre-trained and whose weights are kept fixed during training (known as transfer learning, although the amount of prior information shared is limited by the kind of pretraining performed), and (3) specially customized initialization of weights (usually to accelerate convergence).
Two of the most prominent architectures used in deep learning are Convolutional Neural Networks (ConvNets) and Recurrent Neural Networks (RNNs). ConvNets are a set of deep neural nets which have been successfully applied to different kinds of classification and regression problems (Ciresan et al., 2011;L. Deng & Yu, 2014;LeCun et al., 2015) for image data. These networks work by detecting local patterns at multiple scales in the input, and map them to the appropriate class (classification) or continu- ous output (regression). RNNs, on the other hand, are a class of deep neural nets designed for understanding the structure of data with a sequential ordering. These have been used extensively for text prediction, natural language processing and regression (Hochreiter & Schmidhuber, 1997;Sutskever et al., 2014).
In this work, we use a ConvNet (Szegedy et al., 2015) pre-trained on the ImageNet database (J. Deng et al., 2009), and couple it with a trainable Long-Short Term Memory cell (LSTM) implementation of an RNN (Hochreiter & Schmidhuber, 1997). We use EUV images in 193Å and 211Å from Atmospheric Imaging Assembly (AIA) onboard SDO as input, making features like CHs and ARs clearly visible, and predict the SW speed present in the NASA OMNIWEB dataset. The network is not given any prior information about the physical mapping between the EUV image data and SW speed, and a direct regression is performed from a time series of AIA images to the SW speed. This work is presented as follows: In Sec. 2 we describe how the input and output data are preprocessed, partitioned into training and test sets, and define some control parameters and evaluation metrics. Then, in Sec. 3, we briefly introduce the various algorithms used as our benchmarks. We detail our proposed model WindNet, and the visualization technique used for generating activation map. The segmentation algorithms used for the generation of binary masks for the computation of mean activation values, are also described. In Sec. 4, we summarize our model predictions vis-a-vis our benchmarks, present the trends of mean activation, and draw conclusions in Sec. 5.

EUV dataset
The AIA (Lemen et al., 2012) onboard the SDO (Pesnell et al., 2012) is a four-telescope array observing the full Sun in visible, UV and EUV wavelengths. An ML dataset curated from SDO instruments (hereafter, SDOML) has been made publicly available (Galvez et al., 2019)(https://purl.stanford.edu/jc488jb7715 and links therein). AIA images in the dataset have been resampled onto a grid of 512x512 pixels with 4.8 arcsec pixel spacing and are available at 6 min cadence. SDOML images are stored as binary arrays in the Python numpy format (Walt et al., 2011). Our training and testing data include AIA images taken at 00 : 00 UTC for each day. The selected image forms a proxy for the whole day of observation. However, if the image at 00 : 00 does not exist (as is the case with many days), the closest image to 00 : 00, from that day, is taken as a proxy for that day.
Even during non-flaring times, solar EUV images can have a dynamic range that greatly exceeds the 8-bits per channel dynamic range typical of most computer-vision datasets. For this reason, the input AIA images are first preprocessed by performing logscaling to bring out fainter features. The images are then passed through a threshold and saturation, which limits the dynamic range of pixel values. This was done to limit the prediction to contribution from Solar disc alone. Furthermore, it was seen that the model performance was better with thresholded and saturated images -thus, the dynamic range was limited. A general sweep of threshold and saturation was performed for a particular combination of History and delay(2.4), for the 193Å data. The correlation of predicted SW speed with observed SW speed 0.48±0.03 for the best set, with higher thresholds (log(250),log(10000)) giving us 0.46±0.02 and lower thresholds (log(100),log(1000)) giving us 0.35±0.02. A coarse search was performed to find the threshold values. The thresholds for 193Å data were scaled to 211Å through a ratio of maximum intensities on a given day selected randomly. Equations (1) and (2) specify the threshold and saturation operations for log scaled 193Å and 211Å channel images, respectively. AIA 193Å and 211Å data, with CHs and ARs marked, after preprocessing are shown in Fig. 1.
The pixel values are then rescaled between 0.0 and 255.0.This is done since our feature extractor expects inputs within this range of values.

SW dataset
The target output of the prediction models is daily-averaged SW speed measured at L1. Daily averages are used here since the variation of wind speed over a day is not large, and the variation across the mean value sets the uncertainty in wind speed value. The variation (or the standard deviation σ) is calculated as the variance in hourly measurements over the day, at the OMNIWEB archive (available online at: https://omniweb .gsfc.nasa.gov/form/dx1.html). A representative variation in SW speed data over 10 days is plotted in Fig. 2. The distribution of SW speed and the corresponding σ for the entire dataset is shown in Fig. 3.
There are gaps in the AIA EUV data (30 days of missing data in 211Å and 31 days of missing data in 193Å) for 00:00 UTC, owing to various reasons ranging from calibration maneuvers to recoveries from instrument anomalies. Thus, the SW speed during these gaps have been removed to form sets of {image, wind speed}.

Dataset partitioning and Cross Validation
Data are available from 1 January 2011 to 9 December 2018. Given the presence of a background solar cycle and events in the Sun which might systematically bias our model to perform only for a particular phase of cycle, the whole dataset was partitioned into batches comprising 20 contiguous days of data. If during batch formation, there exists a single discontinuity in the batch, the data from the day prior to discontinuity to 20 days prior is sampled, and placed in the same place as the previous batch, thereby removing any data leak. If there exist multiple discontinuities (there are only 2 instances 2 0 1 1 -0 1 -0 1 2 0 1 1 -0 1 -0 2 2 0 1 1 -0 1 -0 3 2 0 1 1 -0 1 -0 4 2 0 1 1 -0 1 -0 5 2 0 1 1 -0 1 -0 6 2 0 1 1 -0 1 -0 7 2 0 1 1 -0 1 -0 8 2 0 1 1 -0 1 -0 9 2 0 1 1 -0   Testing set Training set of such an event in either of the datasets), that particular window between the discontinuities is discarded. This results in 157 batches for 211Å and 158 batches for 193Å data (courtesy the one missing data, which resulted in a new batch). These batches were randomly sorted into 5 folds with equal probability, and these 5 folds were used to perform cross validation. The dataset partitioning scheme is shown in Fig. 4. In cross validation, if there are [1,N] folds of data, a cross validation set is constructed by holding the fold i as the test set, and the remaining in training set. Such a construction is done for all folds of the batches. Our models are evaluated against this cross-validation dataset, thereby providing us with a mean value of the metric and a standard deviation. Henceforth, any standard deviation associated with the predictions are to be taken as evaluated on the cross-validation dataset.
The image data are centered using the mean pixel value of the training dataset per cross validation fold. The images are resized to 224×224 pixels using OpenCV's (Bradski, 2000) default Linear Interpolation, and each image replicated into 3 RGB channels. This was performed as our pre-trained network demands the input images to be of dimensions 224×224×3, since terrestrial images generally have Red, Green and Blue as the color basis. These images are then finally used for training our network. The solar wind speed data are scaled between 0 and 1 using the training data statistics (max and min values) of each cross validation fold.

Control Hyperparameters
Hyperparameters are free parameters which give a handle in controlling the whole algorithm. We define two control hyperparameters: history H -number of days of input data required for one prediction and delay D -time from the latest input datapoint to the day of SW prediction. For example, if the day of prediction is T , and data from T -3 to T -6 are used as input, our history is defined as 4 and the delay as 3. We have trained models with different combinations of delay (D =1 to 4) and history (H =1 to 4), resulting in 16 variants of the WindNet model. The meaning of the two control hyperparameters are illustrated in Fig. 5.

Metrics for Comparison
Quantitatively, a set of metrics need to be defined to unambiguously quantify if the fit is good or bad. We define three metrics to estimate the goodness of fit (ŷ = Prediction, y = Observation): 1. Mean square error (χ 2 value): where N is the no. of datapoints. However, we present the Root Mean Square Error (RMSE), defined as χ 2 , in the units of km/s. 2. Reduced mean square error (χ 2 red ): where σ i denotes the standard deviation associated with each observation y i . The standard deviation is computed over OMNI measurements for each day, and reported in the dataset. 3. Pearson correlation coefficient (r): The standard definition of correlation is whereȳ andŷ represent the mean values and σ y , σŷ represent the standard deviation of the dataset in consideration. To perform an average of correlation across all folds, we transform the data to Fischer's z-space, perform the averaging, and then transform back -to prevent bias while performing average (Corey et al., 1998). The standard deviations are calculated in Fischer's z-space, and propagated back to correlation.
The three metrics defined have their own advantages and drawbacks.
1. The predicted data are scaled between 0 and 1. Hence, even a large deviation, when squared, seems very small, if both the prediction and observation are < 1. Thus, while χ 2 is a good minimizing function for training, it fails to perform well as a metric for good fit. 2. To counter the above case, the χ 2 red metric is used. This takes into account the inherent error in each measurement, and scales the fit accordingly. A bad fit for a high error datapoint is acceptable, as the observation itself has high uncertainty, while a bad prediction on a low error datapoint is bad, since it serves as a much better point of comparison. 3. If the output were a naive mean value of the batch, the χ 2 and χ 2 red would still be reasonable -however, there would be no variance in the fit. Hence, the Pearson correlation r is used to understand the trend captured by the fitted curve. The exact fit values may not match, but if the trend is captured, the model is fairly good according to this metric.
Summarizing, Pearson r captures the trend but ignores any scaling error. χ 2 captures scaling errors, but doesn't perform well on scaled data < 1. And χ 2 red captures the errors by weighing them vis-a-vis the variance of observed data. These three metrics are used for comparing the models -i.e, the r value of our proposed model should be higher, and the χ 2 and χ 2 red lesser than the benchmark models. Please note that errors (or spread) reported (for both the metrics, and activation evaluation) later on correspond to Standard Error (or uncertainty in the estimated mean), defined as where σ(x) is the standard deviation derived from the sample, and N (x) is the number of samples in the set. Model performance, while accounting for timing errors, is an importance marker for capturing the response to dynamic events in the SW. Thus, we also compare the performance of our models through its ability to capture High Speed Enhancements (HSE), as used in several texts (Owens et al., 2005;Reiss et al., 2016;Bu et al., 2019). We use the method as outlined in Jian et al. (2015) for finding out HSE. This is performed as: 1. Mark all time points which are more than 50 km/s faster than 1 day earlier.
2. Group each contiguous block of marked points as a distinct high speed enhancement (HSE) and find the start and end time of each HSE. 3. For each HSE, find the minimum speed starting 2 days ahead of the HSE till the start of the HSE, and mark it as the minimum speed (Vmin) of the HSE; find the maximum speed starting from the beginning of the HSE through 1 day after the HSE and mark it as the maximum speed (Vmax) of the HSE. 4. For each HSE, find the last time reaching Vmin and the first time reaching Vmax and mark them as the start and end time of an SIR. 5. For the regrouped SIRs, find the Vmin and Vmax for each SIR and mark the last time of highest speed gradient as the stream interface (SI), the boundary between slow and fast wind. Eliminate SIRs with the redundant SI time. 6. Reject any SIRs with Vmin faster than 500 km/s, or Vmax slower than 400 km, or speed increase less than 100 km/s.
Threat score is a proxy for the accuracy of forecast of any model. A model which predicts all the HSE perfectly (while not predicting any spurious HSE) has a TS of 1 -thus, lower the TS, worse the model. For every cross validation set per model, the HSEs are identified and TS calculated -thereby giving us a mean TS and its uncertainty per model. Note that if the HSE (i.e the peak of the enhancement) occurs very near the boundary, it would be missed by the algorithm due to our data partitioning scheme. Such HSE are discarded by benchmarking the H=1,D=1 Persistence model to give a TS = 1.0. This study does not account for the effect of ICMEs (Near-Earth Interplanetary Coronal Mass Ejections). There are 170 ICMEs reported within the time range considered in this study, affecting solar wind measurements in 336 days. In both model training and evaluation, we did not remove days for which there were ICMEs. The prediction of solar eruptions leading to CMEs and ICMEs is outside the scope of this study. Nevertheless, their occurrence impacts the solar wind measurements at L1. So for evaluation of the solar wind models in this paper, we decided to include even the days when ICMEs were present.

Benchmark Models
We next describe various models taken as benchmarks for our proposed WindNet model. These benchmark models all operate as autoregressive models on the SW data only and do not use AIA images as input. The models (except 27-day persistence) are all corrected for the data gaps, thereby making the comparison reasonable.

Autoregression using a 'Mean value'
One of the most basic benchmarks for any model is the comparison of the fit with a mean value model. This benchmark takes in the SW data and outputs the mean value of the whole batch. This model serves as the lowest benchmark that the proposed model should surpass, since untrained models output mean values.

Persistence Model
The second benchmark model is persistence. The SW speed is fed in as input, and the same output is obtained. Such a model would show how long the data persists through time.
The N day persistence is calculated from H + D − 1 days prior to prediction, to the day of prediction. As such, there is no individual dependence of the persistence model on H or D -rather, the dependence is on the combined value, thereby having degeneracy. This model is primarily used for determining how far into the future our models consistently give a good prediction, given an observation today, or observations starting today. We also benchmark our results against 27-day persistence for 1 Carrington rotation, as it has been shown to be a good benchmark model in Owens et al. (2013). The 27-day persistence model operated on the complete SW dataset (devoid of any gaps).

Autoregression using XGBoost
The SW speed is autoregressed for different H and D using the XGboost algorithm (Chen & Guestrin, 2016). That is, the predictionŷ T+1 is given asŷ T+1 = f (x), where model input is x = (y T−H−D+1 , y T−H−D , ..., y T−D ), and the function f () comprises the gradientboosted decision trees. The various parameters set for the algorithm are shown in Table. 1. The best model from the swept set of parameters is selected based on the lowest χ 2 value.

Autoregression using support vector machines (SVMs)
SVM is also used as a benchmark for good fit, since it has more non-linearity than decision trees due to the presence of kernels. Three kernels are used for benchmarking -Radial Basis function, Linear, and Polynomial kernel of degree 5. We use the Scikitlearn (Pedregosa et al., 2011) implementation of SVM in this work. The parameters were selected by grid search using the χ 2 value as the comparison metric. The best fitting parameters are shown in Table. 2.

Proposed SW model
The methodology followed here is to define a feature extractor, which reduces the dimensionality of AIA images into a set of generic features, and a regressor which then takes these abstract features to regress against the solar wind speed. The proposed model WindNet is a deep learning model constructed using a ConvNet and a RNN. Here, a pretrained GoogLeNet (Szegedy et al., 2015)   with the objective of classifying them into different categories. As mentioned in Sec. 1, ConvNets work by detecting patterns at multiple scales in the input. This scale over which patterns are detected is given by the size of the convolution kernel. ConvNets generally have convolutions performed sequentially -thus, at a given layer, the sensitivity is only to a particular scale. However, GoogLeNet, for the first time, introduces us to the concept of the Inception module (Szegedy et al., 2015). Essentially, this module has, at each layer, convolutions using different kernel sizes done in parallel. Thus, it provides sensitivity at multiple scales at the same time. This has been shown to outperform other models on the ImageNet14 dataset (Szegedy et al., 2015). GoogLeNet has been trained on everyday objects. However, given the large volume of training data in ImageNet14, the initial layers of the network capture generic global features in the images (Goodfellow et al., 2016). As one goes deeper, the network captures features specific to the dataset -which is not relevant to our dataset. Thus, we use this pretrained network and generate an embedding corresponding to the AIA data. This technique is known in the literature as Transfer learning (Yosinski et al., 2014). We adopt a 'multi-resolution approach' to generate the embeddings -i.e, responses from layers at different depths are taken, normalized and concatenated. The embeddings are then fed to an LSTM for regression against the SW speed. GoogLeNet has its weights fixed, while the LSTM (and a fully connected layer at the end) are trained. We use a single LSTM cell in our work. The model is developed using the Tensorflow package for Python (Abadi et al., 2015), and has been summarized in Fig. 6.
The training details for the algorithm are summarized in Table. 3.

Activation Visualization
There exist techniques in the DL literature to visualize neurons in hidden layers which are preferentially activated for a given input -this activation can be extrapolated back to the given input to understand which regions of the input data have large impact on the prediction. These methods rely primarily on the gradient of output w.r.t each input pixel, thereby providing an approximation of regions most responsible for an increase or decrease in the output. The methods, while not being perfect visualizers, are a window into the workings of the network. In this work, we use Grad-CAM (Selvaraju et al., ) activation function to obtain the activation map. The maps are averaged across channels, and then scaled up to the dimensions of the input image for comparison. This method produces activation maps of the model on the input data. These activation maps are subsequently used to generate a metric for the determination of influence of the CHs and ARs.

Generating binary masks
A simple metric for understanding the influence of a particular set of features for a regression problem would be to look at the mean value of the activation on that particular set of features across all datapoints, and look for the variation of this mean value over days leading to prediction. Therefore it is of great importance to segment out the CHs and the ARs to generate binary maps. To obtain the CH segmentation map, we use  Otsu thresholding (Otsu, 1979). This thresholding assumes the presence of two distinct classes of pixels intensities, essentially Gaussian, and tries to find an intensity value which would maximize the inter-class variance (or alternatively, minimize the intra-class variance). We use stacked thresholding -i.e, a preliminary threshold to segment out the approximate region of the coronal holes first, and then another threshold to segment out the coronal holes from this subset of the image.
The AR segmentation is far more non-trivial. Otsu thresholding picks out spurious areas as 'active regions'. Hence, we apply a 5-class Gaussian Mixture Model (Pedregosa et al., 2011) on the pixel intensities to segment out the ARs. The Gaussian with the highest mean is found to segment out the ARs well. A representative set of segmentation maps overplotted on the EUV data is shown in Fig. 7.
With these binary maps, we simply perform a pointwise multiplication of our activation values on a given image with its CH and AR map respectively, while also scaling by the total area of segmentation. The scaling by area of CH and AR is done to remove dependence on the absolute size of these regions, and obtain a normalized quantity. We then take the mean value over the image, and across all datasets to obtain a single scalar to quantify the activation at ARs and CHs, across the days of history for both fast and slow SW. The activation plots are constructed for the training set (for better statistics), since the generalizability of the model is captured in its performance on the test set (or the cross-validation set).

Model benchmarking
From Table. 4 through Table. 7, we have summarized the performance of Wind-Net, as well as the benchmark autoregressive models for the metrics defined -Correlation (r), RMSE, χ 2 red and TS respectively . We see that WindNet outperforms the benchmarks over combinations where delay is generally more than 1 -i.e, where the autoregressive models do not have the immediately preceding solar wind speed available. In fact, for larger delays and histories, WindNet shows consistent performance, while other models fail to perform a reasonable prediction. The best performance of WindNet is for a history-delay combination of (4, 3), wherein the correlation is ≈ 0.55, and the spread is 0.03 -this is for 211Å. Similarly, the best fit using 193Å data occurs for a combination of (2, 4), with a correlation of 0.51, with a spread of 0.03. The Naive mean model has no variance, so there would be no correlation associated with it -however, it is presented for the sake of completeness. Autoregressive SVM using an RBF kernel seems to perform better given the SW speed closer to the day of prediction, but falter as more delay is induced. The linear SVM performs as well as the non-linear RBF kernel, but the polynomial kernel fails to get a good fit. The 27-day persistence is a set of just 5 models -thus, this performance is stated in the caption of the respective Tables.

WindNet prediction
In this section we investigate the variation in prediction for our WindNet models. The model with highest correlation, as mentioned previously, is for a history of 4 and delay of 3 for 211Å. As can be seen in the Table. 4, there seems to be a subtle trend of an increase in correlation with history for a given delay for short delays. The predictions for models with delay smaller than history seem mostly consistent within the error bars. For the 193Å model in Table. 4, it can be seen that an increase in delay for a given history results in almost a consistent prediction correlation for high history models (again, within the errorbars -though the mean values do not seem to follow an ordered trend),   0.0 except in the case of 1 day history, where the correlation increases. This trend of increase in delay for a given history is largely followed in the 211Å data, though the 4-day history seems to be the most consistent in this case within the errors, and the best performing. In general, the expectation would be an increase in correlation with increasing history, and some form of variation due to an increase in delay. The variation in performance with history for small delays is fairly consistent between both 193Å and 211Å with only the actual correlation values being different -however, larger delay models do not have the same variation in performance for 193Å and 211Å. 211Å, in fact seems to be a better channel for SW prediction, since the corresponding models have higher correlation means and smaller standard deviations. Short-delay and short-history models (for example, 1 day history and delay) do not perform as well as models with larger history and delay (for example,4 day history and 3 day delay), since the solar wind is yet to arrive at L1. 193Å data shows a peak in correlation at 2 day history and 4 day delay. The 211Å data shows a similar peak at 4 day history and 3 day delay.
A summary of RMSE is shown in Table. 5, and a similar summary of χ 2 red is shown in Table. 6. The TS is tabulated in Table. 7. The TS table shows that our proposed model has a maximum of 0.357±0.03. The low TS may be explained better by a careful observation of Fig. 8. This is a plot of one of the cross validation models using 211Å, having the highest correlation, with 4 day history and 3 day delay of data. With 10 TP, 2 FP, and 13 FN, the model is seen to have a TS of 0.4, a correlation with observation of 0.61, RMSE of 76.4 km/s and χ 2 red of 19.35. Here, we see that there are many more HSE present in the observed wind speed, which seem to be missing from the prediction. However, upon careful observation,it may be seen that many of the observed HSE do correspond to an enhancement in the wind speed of the predictor -either at the exact time step, or with a lag/lead of 3 to 4 days. However, the predicted values do not show the drastic enhancement more than the prescribed thresholds. Thus, these events are not marked as HSE. The WindNet performance on the error metrics, though, largely complements the correlation performance, and show WindNet has better performance than the benchmark models for delays larger than 1 day in most cases.

Activation visualization
Using Grad-CAM activation maps (described in Sec. 3.3.1) to visualize the activation, we analyze the activation of our models for various days of data. Fig. 9 shows a sample Grad-CAM map from a fast and slow wind prediction using 211Å data, and a similar map is shown from 193Å prediction in Fig. 10, for comparison. We see that the CHs are activated for the prediction of the fast wind, and the ARs are predominantly activated for the slow wind prediction. The CH peak activation for fast wind occurs 3 to 4 days prior to prediction, which seems to corroborate with the correlations independently obtained (Vršnak et al., 2007). The slow wind activation is peaked at the AR close to the day of prediction (and also at the earliest day prior to prediction for 211Å data), with activation at other regions of the Sun further away from prediction. We hypothesize this might be due to bias of the LSTM to the most recent input to the networkbut this is still a hypothesis.
To understand the statistics of activation given to CHs and ARs, we look at the mean activation value (as described previously), and plot it for 'fast-wind' and 'slow-wind' predictions from the model. While each cross validation model set will have its own activation plot, we present the plot for both 193Å and 211Å models using 4 day history of data with 3 days of delay. We also plot the activation for the models using 4 day history of data with 2 days of delay, since it shows consistent (and good) performance using both 193Å and 211Å data. These trends are shown in Fig. 11 and Fig. 12 respectively.
-20-manuscript submitted to Space Weather Figure 9. GC activation maps for a fast (top) and slow wind (above) prediction using 211Å data, with the colormap corresponding to each row given on the right. The activation maps and the images have been rescaled between 0 and 1 row-wise for ease of comparison. For the fast wind prediction, note how the maximum activation occurs at the CH, 3 to 4 days prior to prediction, which seems to match with the correlations obtained in the literature (Vršnak et al., 2007).
The slow wind, on the other hand activates the AR closer to the prediction, with no activation at the CH.
It can be seen from these plots that the fast SW induces greater activation at the CHs closer to the day of prediction, and the activation (at CH) decreases as we go farther into the past -however, for the 211Å model, the activation shows slight increase. The fast wind also seems to activate the AR at much further times -for both 193Å and 211Å. Note, however, that the peak CH activation is larger than the peak AR activation for 193Å -for 211Å data, they are consistent within the errors in Fig. 11. For the same parameters in Fig. 12, CH peaks at 3 days prior to prediction for both the chan-data, with the colormap corresponding to each row given on the right. The activation maps and the images have been rescaled between 0 and 1 row-wise for ease of comparison. For the fast wind prediction, note how the maximum activation occurs at the CH, 3 to 4 days prior to prediction, which seems to match with the correlations obtained in the literature (Vršnak et al., 2007).
The slow wind, on the other hand activates the AR both closer and further away from prediction, and activated at the small CH on the closest day to prediction. However, other regions of the quiet Sun show a higher activation further away from the day of prediction. The slow wind activation is quite mixed and unclear when compared with the fast wind activation. nels, and then goes down. Interestingly, however, the AR also seems to be activated to a similar level, but much further away from prediction time. For the slow wind, activation for ARs remains high for much longer duration than the CHs -however, the peak occurs closer to the day of prediction, rather than further away from prediction. This trend is seen in both Fig. 11 and Fig. 12.

Discussion
The problem of SW prediction can be approached in two ways -one, through purely theoretical modelling of the mechanism and fine tuning parameters to fit observations, and two, through purely empirical modelling and attempting to extract the physics. We propose the WindNet to empirically model SW speed using AIA imagery data alone. We are able to predict the SW speed better with the 211Å data, and obtain a correlation of 0.55 with the observed wind speed in the cross validation. The best performing models using 193Å and 211Å both outperform most of the larger delay benchmark models, and the 27-day persistence model. The χ 2 red , which accounts for uncertainty in the measurement itself, indicates that our best models outperform the 27-day persistence,and  are only slightly worse off than an autoregressive model with a single day delay -more so for lead time predictions of 3-4 days. The Activation plots seem to suggest the trained WindNet model pays attention to certain solar features consistent with heuristic expectations from solar wind theory, ex. the peak in activation at the CH 3-4 days prior to prediction. However, the significance of interpretation of activation values depend on a couple of other factors: • Fitting error of our model: We still have a maximum correlation of 0.51(0.55) for the 193Å(211Å) data. • Visualization: The Grad-CAM used in this work gives a very coarse localization of activation, and thus may not point to precise origin of the particular kinds of wind. • Segmentation: Defining a region as CH accurately is difficult with intensity values alone -ideally, one would require extrapolated magnetic field lines to check for CHs. Thus, an accurate definition of CHs based on intensity is required. In this work, we attempted to automate both the CH and AR definitions using his-togram analysis automatically, thus there is bound to be some form of uncertainty. Hence, better segmentation methods may accurately capture the entire activation within a CH or an AR, and give a much better estimate of activation per unit area.
At first glance, our WindNet may appear to not outperform existing models in terms of the metrics used. However, comparing our model to existing models (like the regressive models of (Rotter et al., 2015), or (Wang & Sheeley Jr, 1990)) would be an applesto-oranges comparison, since: • We perform predictions over multiple Carrington rotations on the whole 8-year dataset. • Our prediction target is the daily averaged SW speed, and as such must be compared to daily averaged predictions by other models. . • We perform 5-fold cross validation on this dataset. However, due to a lack of confidence intervals in the previous results, we are unable to check if our results are statistically different from the existing models. • Our model is built entirely using open source software, and the codebase may be used by the community at large to improve upon the results.
Thus, any benchmarking of our model must be done with models undergoing the same data preparation procedure, the same span of data and at the same cadence. We thus do not compare our results with the existing aforementioned models.
To overcome this limitation, we propose empirical benchmark models not unlike the existing empirical solar wind prediction models. In this regard, WindNet shows reasonable performance vis-a-vis the benchmark models -however, there are numerous improvements possible.
• Data preparation: As H+D increases, more samples are discarded (as explained in Sec. 2.4). This may be made more efficient by performing the Cross Validation (CV) first, and then splitting into folds later with the downside of high memory consumption. • ICME mitigation: Our random assignment of 5-fold cross validation is to ensure the ICMEs are distributed uniformly across all the folds, thereby influencing all the CVs equally. Due to inadequate number of ICME samples, we do not characterise them. • Network architecture: Better architectures may be designed to improve the prediction vis-a-vis the observations, or more novel ML methods may be employed for a direct prediction. • Visualization: Visualization of ML models is an hot area of research in the ML community -thus, more accurate visualization techniques may be expected to emerge in coming years. • TS evaluation: As seen in Sec. 4, the HSE capturing algorithm misses many potential enhancements due to the speed increases not satisfying the absolute speed change criteria. Hence, the TS evaluation should be taken with caution.
This work is a first step toward training and testing various ML models for predicting other SW target parameters, such as proton density, temperature and magnetic field (specifically, B z ). The code and data used in this work is open-sourced (model may be found on github: https://github.com/Vishal-Upendran/WindNet). Our publicly released source code promotes reproducible research by allowing others to reproduce the results presented here. This includes data partitioning, cross-validation, model training, and evaluation. This code base can be built upon by other researchers to further improve the performance of solar wind prediction models. Furthermore, with the ever-increasing research on Interpretable AI, this codebase may be used by researchers to come up with various methods of visualizations to quantify regions of solar wind emergence.