GeoDGP: One-Hour Ahead Global Probabilistic Geomagnetic Perturbation Forecasting Using Deep Gaussian Process
Abstract
Accurately predicting the horizontal component of ground magnetic field perturbation (d), a key quantity for calculating the geomagnetically induced currents (GICs), is crucial for assessing the space weather impact of geomagnetic disturbances. The current operational first-principles Michigan Geospace model provides effective forecasts of d, but requires significant computational resources to achieve real-time speeds. Existing data-driven methods tend to underpredict d and lack uncertainty quantification, which is either overlooked or treated as secondary. In this work, we introduce GeoDGP, a novel and efficient data-driven model based on the deep Gaussian process. GeoDGP provides global probabilistic forecasts of d with a lead time of at least 1 hr, at 1-min time cadence, and at arbitrary spatial locations. The model takes solar wind measurements, the Dst index, and the prediction location in solar magnetic coordinate system as inputs, and is trained on 28 years of data from SuperMAG global magnetometer stations. Additionally, GeoDGP is also trained to predict the north (d) and east (d) components of perturbations. We evaluate GeoDGP's performance at over 200 stations worldwide during 24 geomagnetic storms, including the Gannon extreme storm of May 2024. Comparisons with the first-principles Michigan Geospace model and the data-driven DAGGER model revealed that GeoDGP significantly outperforms both across multiple performance metrics.
Key Points
-
GeoDGP is a data-driven model that provides global probabilistic geomagnetic perturbation forecast at 1 min cadence and 1 hr ahead
-
We evaluate GeoGDP on a wide range of geomagnetic storms and at over 200 magnetometer stations across the globe
-
GeoDGP outperforms both the state-of-the-art first-principles Geospace model and the data-driven DAGGER model across multiple metrics
Plain Language Summary
Ground magnetic field perturbations are crucial for predicting geomagnetically induced currents, which can harm power grids, communication systems, and other ground infrastructure. Accurately predicting these perturbations with high temporal and spatial resolution is critical but remains a significant challenge in space weather forecasting. In this work, we introduce GeoDGP, an advanced data-driven model that provides reliable forecasts of these perturbations at least 1 hr ahead, at 1 min intervals, and can be used for any location. GeoDGP's performance is evaluated across a wide range of geomagnetic storms using data from over 200 magnetometer stations worldwide. The results show that GeoDGP significantly outperforms leading existing first-principle and data-driven models.
1 Introduction
Geomagnetically induced currents (GICs), driven by geomagnetic storms, can significantly impact modern technological systems such as power grids, natural gas pipelines, and telecommunication systems (Boteler, 2001; Eastwood et al., 2018; Pirjola et al., 2000; Pulkkinen et al., 2001). Predicting GICs directly for the full surface of Earth is challenging, primarily due to the limited availability of GIC data. This limitation arises from the proprietary nature of data sets and the spatial-temporal sparsity of measurements. As GICs are driven by the geoelectric field at Earth's surface, they can also be obtained from the horizontal component of ground magnetic field perturbations d, and the ground conductivity (Huang et al., 2004).
Currently, the NOAA Space Weather Prediction Center uses the Michigan Geospace model (referred to as the Geospace model hereafter), a component of the Space Weather Modeling Framework (SWMF; Gombosi et al., 2021; Tóth et al., 2012), to produce ground magnetic field perturbation forecast maps. The first-principles Geospace model solves partial differential equations on various grids, providing high-fidelity and physics-justified predictions but is generally computationally expensive. Its prediction performance of d during storm time was comprehensively studied by Al Shidi et al. (2022), where simulations were conducted for 122 geomagnetic storms from 2010 to 2019 and evaluated at over 300 magnetometer stations around the world. While the model achieved a median Heidke Skill Score (HSS) of 0.45 using a threshold of 50 nT for magnetometers in all latitude regions, the prediction of high-latitude regional disturbances remained challenging.
Several data-driven models have also been developed to predict d, and/or more challengingly the horizontal component of the time derivative of ground magnetic field perturbations (Viljanen et al., 2001), following the Geospace Environment Modeling (GEM) challenge that ran from 2008 to 2012 (Pulkkinen et al., 2013). These models are trained on available historic storm data to directly learn the mapping from solar wind and geophysical features to the quantity of interest, bypassing the complexities of explicitly modeling the governing physics. A key advantage of data-driven models is that they are computationally inexpensive to evaluate once the training is finished. As an example, Pinto et al. (2022) experimented with a feed-forward artificial neural network (ANN), a long short-term memory (LSTM) recurrent neural network, and a convolutional neural network (CNN) and compared the performance of these models in predicting across six ground magnetometer stations studied in the GEM challenge. Coughlan et al. (2023) explored categorical forecasts to mitigate the heavy-tailed distribution of , using a CNN to predict whether it would exceed its 99th percentile threshold 30–60 min in the future. These studies have proven that directly predicting from solar wind features is rather challenging due to its highly variable and chaotic nature. As a result, model performance is often evaluated using threshold-based metrics such as skill scores (Pulkkinen et al., 2013).
On the other hand, d is more predictable than (Kellinsalmi et al., 2022; Tóth et al., 2014), and it shows a strong correlation with that can be approximated by a power law (Tóth et al., 2014). Keesee et al. (2020) and Blandin et al. (2022) used ANN and/or LSTM models to indirectly predict (or , the time derivative of the magnitude of the northward component) by training models to predict d (or ). However, these approaches did not consistently outperform benchmark models in predicting . In this work, we show that certain GIC-related risks, such as transformer heating, can be quantified directly from d without relying on , in a manner similar to Hu et al. (2024). We also show that significant improvement in d prediction can be achieved. Prior work by Iong et al. (2024) developed a Gaussian process (GP) model incorporating contaminated Gaussian noise (Gleason, 1993) to predict maximum over 20-min temporal bins with uncertainty quantification (UQ) at selected stations. These aforementioned approaches mainly focus on a single-station modeling strategy and do not consider or exploit the spatial-temporal correlations between different stations. Upendran et al. (2022) expanded the scope to global prediction, proposing the DAGGER model that couples spherical harmonics with deep learning to generate d predictions for mid- and high-latitude regions in the entire Northern Hemisphere. While these models show an ability to predict well the timing of magnetic perturbations, they tend to consistently underestimate d during intense geomagnetic storms (Camporeale et al., 2020), where d can still be highly variable and with peak values exceeding 1,000 nT. Furthermore, neither the Geospace model nor the DAGGER model systematically integrates predictive uncertainty estimation for global forecasting, which would be valuable for understanding model reliability and support operational decision-making.
We develop a new data-driven model for global d predictions—we call it GeoDGP—based on the deep Gaussian process (DGP) (Damianou & Lawrence, 2013), a Bayesian modeling approach. Specifically, GeoDGP provides global probabilistic forecasts of d at 1-min time cadence and at arbitrary spatial locations. The model takes as input the queried location as well as storm features, which includes solar wind and interplanetary magnetic field (IMF) measurements, solar radio flux measurements (the index), and the disturbance storm time (Dst) index. A distinctive and important idea of GeoDGP is to encode the location of stations relative to the Sun as an input feature, using the solar magnetic (SM) coordinate system. This idea leverages our physical understanding that magnetic perturbation primarily depends on the location in SM coordinates instead of the geographic location (Kivelson & Russell, 1995). As a result, even with the same data set, magnetometer stations effectively provide information over fixed magnetic latitude circles but all magnetic local times as the Earth rotates, instead of single geographical points. This improves coverage over the surface of Earth and enhances spatial interpolation to different locations, potentially improving the model's forecasting capabilities, though it may neglect local ground conductivity and geological effects.
We train GeoDGP to operate with two different lead times. The first lead time is the time it takes for the solar wind and IMF observed near the first Lagrange point (L1) to reach Earth (referred to as hereafter), which varies from about 30 min at high speeds of 800 km/s to around 1 hr at typical speeds of 400 km/s. The second lead time is plus 1 hr. While this longer lead time poses greater challenges due to the difficulty in predicting future solar wind and IMF, it significantly enhances the model's utility for practical applications, offering additional warning time. Moreover, nightside ionospheric current systems, including field-aligned currents (FACs), which are strongly correlated with ground magnetic field perturbations, can respond to solar wind driving with a time lag on the order of an hour (Coxon et al., 2019).
GeoDGP is evaluated on three test data sets consisting of over 200 magnetometer stations and a total of 24 geomagnetic storms, including the recent Gannon extreme storm in May 2024. Results show that GeoDGP predicts perturbations with magnitudes comparable to observations, even during the extreme storm. Furthermore, it consistently outperforms both the Geospace and DAGGER models across multiple metrics, while also capturing spatial heterogeneity in its predictive uncertainty estimation. Notably, the same model architecture can be used to predict the northward component and the eastward component , which are important for calculating the geoelectric field and the resulting GICs. However, while the model supports predictions at arbitrary spatial locations, its effective spatial resolution is constrained by the smoothness imposed by the learned functions.
The remainder of this paper is organized as follows. Section 2 describes the data used in this study and the motivation for global modeling. Section 3 provides details of the GeoDGP model and the model architecture. Section 4 contains key results on model performance and further discussions. Section 5 presents conclusions, limitations, and ideas for future work.
2 Data Cleaning and Preprocessing
In this section, first we show the correlation and underlying clustering structure of magnetometer stations to motivate the global modeling. Then, we introduce the problem setup and describe the model input and output. Finally, we introduce ground magnetic field perturbations data used in this paper and the data cleaning procedure.
2.1 Station Correlation and Clustering
Then, the matrix is then used as input for clustering analysis to group stations based on their similarities. In particular, we use the hierarchical agglomerative clustering with complete linkage (Hastie et al., 2009), which progressively merges stations into clusters based on their proximity until a pre-specified distance threshold is met. Using a distance threshold of 1.15, we identify 8 disjoint clusters, as shown in Figure 1, where circles represent stations and their colors indicate cluster assignment. We observe that station clusters mostly align with the magnetic latitude, and correlations are noted between stations in the Northern and Southern Hemispheres that share similar magnetic latitudes (in absolute value). Additionally, the denser distribution of stations in the Northern Hemisphere allows for an observation of longitudinal structure in the clustering results. These structures may reflect underlying magnetospheric processes that are dependent on magnetic local time (MLT), such as the distribution of FACs and auroral electrojets. Moreover, longitudinal variations in ionospheric conductivity, which can be driven by solar illumination, may also affect the ground magnetic perturbations and contribute to the observed spatial patterns. These results suggest that modeling ground magnetic perturbations jointly using data from all stations can be more beneficial than a single-station training strategy.

Hierarchical clustering of magnetometer stations based on the correlation of d observations during the Gannon extreme storm. The distribution of station clusters is well aligned with the magnetic latitude grid lines (black and solid).
2.2 Problem Setup
We frame the prediction of ground magnetic field perturbations as a regression task—a supervised learning problem where a model is trained on labeled data to learn the relationship between input features and real-valued outputs. In this work, we train separate models for two lead times: and , where represents the time it takes for the solar wind and IMF observed at L1 to reach Earth. In each case, the model takes input at time and outputs the ground magnetic perturbations at or . The output is produced at 1-min temporal resolution and can be made at arbitrary spatial locations. We train separate models for predicting , , and d.
2.3 Model Input
We group the model input features of GeoGDP into two sets: storm feature input and location input.
The storm feature input captures the drivers of geomagnetic storms. This input set includes the real-time Dst and indexes, along with solar wind and IMF measurements. The Dst index can be obtained from the Kyoto Dst index service, and the other measurements can be sourced from the NASA/GSFC's OMNI data set (Papitashvili & King, 2020). Specifically, we include the solar wind velocity in Geocentric Solar Magnetospheric (GSM) coordinates, the proton number density , the plasma temperature , the IMF in GSM coordinates, and the solar radio flux at 10.7 cm ( index). Note that the OMNI data are already time-shifted to the magnetosphere's bow shock nose (King & Papitashvili, 2005) from their raw measurement at L1 to better align with ground measurements on Earth. Additionally, we augment these features with their history before feeding to the model. The augmentation is done separately for each measurement due to their different time cadences. Based on empirical experiments, we augment the Dst index with 12 hr of history at hourly intervals, as is without any history due to its daily cadence, and solar wind and IMF measurements with 1 hr of history using 5-min medians based on their 1-min cadence. This input augmentation helps account for uncertainty in the time lag between upstream measurements and ground magnetic responses, while the use of 5-min medians for temporal smoothing reduces input noise and keeps the dimensionality manageable for downstream model training. For all measurements, missing values are imputed based on the last available values if the gap is within 15 min; otherwise, we remove the gap interval from the data set. This 15-min threshold is chosen based on the trade-off between maximizing data availability and maintaining data validity (Smith et al., 2022).
The location input contains the dipole tilt angle (the time-dependent angle between the Earth dipole and the Sun-Earth line) and the time-dependent magnetometer station locations (reflect the positions of magnetometer stations relative to the magnetic dipole of Earth and the direction of the Sun). The inclusion of the dipole tilt angle and the IMF components in GSM coordinates allows the model to capture seasonal effects, such as the Russell–McPherron effect (Russell & McPherron, 1973), provided sufficient training data. The station locations are given as the magnetic longitude (similar to the MLT) and magnetic latitude of each station in the SM coordinate system. The SM coordinates rotate with both a yearly and daily periodicity with respect to inertial coordinates. The -axis is chosen perpendicular to the Earth-Sun line pointing towards dusk, the -axis is parallel to the magnetic axis pointing towards the geographic north, and the -axis completes the right handed coordinate system such that the Sun is in the half plane determined by the and axes (Russell, 1971). The 0-longitude in SM coordinates contains the subsolar point, and the longitude increases counterclockwise when viewed from the north magnetic pole. Therefore, this setup naturally incorporates periodicity and seasonality into the model input. Additionally, rather than being a fixed geographical point, a station provides information along a curve of fixed magnetic latitude that covers all magnetic local times due to the rotation of Earth. This allows data samples to share the same location input even when observed from stations at different geographic locations at different times. As a result, this approach is conceptually different from training a separate model for each individual station. One challenge with such modeling strategy is that discontinuities in longitude can lead to discontinuities in model predictions. To address this, we use trigonometric functions to transform the longitude before feeding it to the model to ensure continuity.
The final model input set to GeoDGP is summarized in Table 1.
Symbol | Description |
---|---|
x-component of solar wind velocity in GSM coordinates | |
Proton number density | |
Plasma temperature | |
, , | IMF in GSM coordinates |
Dst | Disturbance storm time index |
Solar radio flux at 10.7 cm | |
Dipole tilt angle | |
Geomagnetic latitude | |
Cosine of the geomagnetic longitude in SM coordinates | |
Sine of the geomagnetic longitude in SM coordinates |
During operational forecasting, only the storm feature input needs to be provided in real time, as time-dependent location input can be precomputed. Leveraging the statistical properties of DGP (Damianou & Lawrence, 2013), GeoDGP allows predictions to be made at any location on the surface of Earth, without being restricted to the fixed locations of magnetometer stations.
2.4 Model Output
The model output variables of GeoDGP are the ground magnetic perturbations components (, , d). The measurement data of ground magnetic perturbations can be downloaded from the SuperMAG website (Gjerloev, 2012). Raw measurements collected from magnetometer stations around the globe are preprocessed by SuperMAG with a common baseline removal approach that subtracts the daily variations and yearly trend, and transformed to be in the same coordinate system and identical resolution (Gjerloev, 2012). Ground magnetic perturbations are highly variable, making model training at full temporal resolution challenging. To address this, a common preprocessing step in the literature, as seen, for example, in Pulkkinen et al. (2013); Iong et al. (2024), is to summarize d or by maximum over 20-min temporal bins. In this work, we preprocess the perturbation data by summarizing each component (, , d) by the value with the largest magnitude within 10-min temporal bins. The model is then trained on these summarized values. During testing, the model predictions serve as proxy estimates for the 1-min temporal resolution ground magnetic perturbations, and are evaluated with respect to the true 1-min values. In other words, the model takes input at time and predicts the maximum perturbation (by magnitude) over the window , which is then used as an estimate for the perturbation at time , with being the model lead time. The use of summary values provides effective temporal smoothing while preserving perturbation peaks, thereby reducing variability in model parameters across different training data sets. We experimented with different sizes of temporal bins and found empirically that the 10-min bin yields the highest model likelihood on the validation set, which will be defined in the next section.
2.5 Data for Model Training, Validation, and Testing
In this study, we use SuperMAG data from all available stations (approximately 350) spanning 28 years (from 1995 to 2022) to develop the GeoDGP model. The data is further subject to the availability of OMNI data set, which provides the model input. We will carry out our data-driven modeling exclusively on data from storm periods. Specifically, a storm period is identified by first finding the storm peak, defined as the time when Dst reaches its local minimum below nT. A 120-hr interval is then formed by extending 60 hr around this peak. This approach balances the amount of data from quiet periods with that from storm periods.
-
The test set consists of all 22 storms in 2015 (also studied by Al Shidi et al. (2022)), the 2011-08-05 storm (studied by Upendran et al. (2022)), and the 2024-05-10 Gannon extreme storm.
-
The validation set consists of randomly sampled 10% of storms with minimum Dst less than nT.
-
The training set consists of all remaining storms.
3 Methodology
The GeoDGP model is developed based on the DGP. We begin by introducing the building blocks of DGP: GP. Then, we generalize GP to DGP and describe the model architecture adopted in this work. Finally, we present and discuss the choice of metrics for model evaluation.
3.1 Gaussian Process Regression
Here is the vector of covariances between and the training latent function values, is a covariance matrix with , is the identity matrix, is the vector of training outputs and is the vector of mean function values at training points.
Applying GP regression to predict ground magnetic field perturbations faces several challenges. First, its computational cost scales as due to the need to invert the covariance matrix, making it impractical for the SuperMAG data set, which contains hundreds of millions of data entries. Second, the expressive power of a GP model is heavily dependent on the choice of kernel functions. Standard stationary kernels are often inadequate since stationarity and/or smoothness assumptions are violated by storm time disturbances. Iong et al. (2024) mitigated this effect by introducing contaminated Gaussian noise (Gleason, 1993) to the GP with standard stationary kernels when predicting . While this technique improved the coverage of prediction interval, it still underestimated the mean . Alternatively, specifying a non-stationary kernel offers more flexibility but has its own challenges. The complexity of data often requires a richly parameterized kernel, which may be difficult to design, prone to overfitting, and expensive to optimize (Calandra et al., 2016; Duvenaud et al., 2013; Noack et al., 2023; Snoek et al., 2012; Wilson et al., 2016).
3.2 Deep Gaussian Process
Training is performed by maximizing the marginal likelihood of the model, and the computational challenge is addressed using approximate inference techniques, such as variational inference (VI) (Fox & Roberts, 2012; Matthews et al., 2016; Titsias, 2009), stochastic VI (SVI) (Hoffman et al., 2013) and minibatch subsampling (Hensman et al., 2013). The main idea is to obtain a low-rank approximation to the covariance matrix of each GP layer using a set of inducing points, whose location can be optimized based on training data, and whose number is much smaller than the number of training data points . This reduces the computational cost to , making DGP scalable to data sets with millions of observations. In this work, we adopt the doubly stochastic variational approach proposed by Salimbeni and Deisenroth (2017), which improved upon the original DGP (Damianou & Lawrence, 2013) by maintaining the correlation between layers while placing no restrictions on the noise corruptions in each layer. For prediction, the output is modeled as a mixture of Gaussian, with each component drawn from the variational posterior of latent functions evaluated at the test location . We set the number of mixture components to 10.
DGP circumvents the challenge of handcrafting a sophisticated nonstationary kernel (Noack et al., 2023), as one can stack GP layers with stationary kernels to introduce a hierarchical composition of nonlinear transformations to the input space, thereby achieving nonstationarity. We show our choice of model architecture and setup next.
3.3 Model Architecture and Setup
The DGP model architecture used in this work is chosen based on empirical experiments. Specifically, the model consists of one input layer, three hidden layers, and one output layer (i.e., ). The dimension of the input vector (introduced in Section 2.3) is 96, while the dimensions of each GP layer are set to be , , , and , respectively. The numbers of inducing points we use for each GP layer are 256, 128, 128, 128, respectively. Each GP layer is equipped with a stationary kernel and a mean function.
We use linear mean function for all hidden layers. Duvenaud et al. (2014) shows that DGP with zero mean functions for the inner layers leads to degeneration in representational capacity when the number of layers increases. Using linear mean function avoids this pathology (Salimbeni & Deisenroth, 2017). These mean function parameters are also estimated from the training set.
3.4 Model Evaluation Metrics
Observed | ||
---|---|---|
Predicted | 1 (above threshold) | 0 (below threshold) |
1 (above threshold) | H (Hit: true positive) | F (False alarm: false positive) |
0 (below threshold) | M (Miss: false negative) | N (No event: true negative) |
We also calculate the sample-averaged interval width and the empirical coverage rate that is the fraction of observations satisfying that has the expected value .
4 Results
4.1 Model Comparisons
-
Test set 1 consists of 22 geomagnetic storms in 2015.
-
Test set 2 consists of the 2011-08-05 and 2015-03-15 storms.
-
Test set 3 consists of the 2024-05-10 Gannon extreme storm.
We use test set 1 to statistically evaluate the performance of GeoDGP and compare it against the Geospace model. Test set 2, although smaller, is specifically used for comparing GeoDGP with the DAGGER model (Upendran et al., 2022), as evaluation of DAGGER is only available on these two storms (Upendran et al., 2022). Finally, test set 3 compares GeoDGP and Geospace during an extreme geomagnetic storm, providing insights into their robustness under severe conditions.
The Geospace model consists of three components of the SWMF (Gombosi et al., 2021; Tóth et al., 2005): the Global Magnetosphere domain represented by the Block-Adaptive Tree Solar wind Roe-type Upwind Scheme (BATSRUS) MHD model (Powell et al., 1993; Tóth et al., 2012), the Inner Magnetosphere simulated by the Rice Convection Model (RCM, Toffoletto et al., 2003), and the Ionosphere Electrodynamics solved by the Ridley Ionosphere Model (RIM, Ridley et al., 2004). The Geospace model is driven by the solar wind and IMF observations ballistically propagated from L1 to the upstream boundary at 32 from the center of Earth and the index. The Geospace model generates, among other products, the magnetic perturbation forecast maps time ahead. Its performance during storm time was studied in Pulkkinen et al. (2013); Tóth et al. (2014); Al Shidi et al. (2022).
The DAGGER model, in contrast, is a data-driven model that combines deep learning with a spherical harmonic basis to predict ground magnetic perturbations min ahead. Specifically, it consists of a Gated Recurrent Unit (Cho et al., 2014) module that summarizes solar wind features, an ANN module that maps these features to coefficients of the spherical harmonic basis, and a spherical harmonic constructor. Upendran et al. (2022) provide a detailed description of the DAGGER model.
GeoDGP has two different forecast horizons: one is , which is the same as the Geospace model and thus allows for a fair comparison, while the other is , for which there are currently no publicly available models forecasting. Here, we additionally introduce model persistence (MP) (referred to as MP hereafter in tables) and observation persistence (OP) (referred to as OP hereafter in tables) as baselines. A lead time persistence model assumes for all . We compare GeoDGP with lead time to its 1 hr MP, which is obtained by using the GeoDGP predictions for lead time to predict the observations at hr. This comparison assesses whether the model with the larger lead time provides more informative predictions for the future than a simple shift of earlier predictions. Observation persistence, on the other hand, is a common benchmark for time series prediction; however, it is often difficult to beat on short time scales. We note that real-time magnetometer observations are only available at a few locations, while the models provide forecast on the entire surface of Earth. For OP, since the lead time depends on the solar wind speed and typically ranges from 30 min to 1 hr, we use an average of 45 min as an approximation. Subsequently, we compare GeoDGP predictions with the and hr lead times to the OP of 45 min and 1 hr 45 min, respectively. The comparison assesses whether the model provides more informative predictions for the future than the current observation.
-
OMNI data (i.e., the model input) is missing and the gap is larger than 15 min, as described in Section 2.3;
-
Station measurements are unavailable; and
-
Predictions from the model being compared (i.e., Geospace and/or DAGGER) are unavailable.
Evaluations involving multiple storms are conducted by first concatenating the storm data and then calculating the evaluation metrics. The performance of the models is compared on a regional basis, as the availability of ground magnetometer stations varies over time. We divide magnetometer stations into low- and mid-latitude, high-latitude, and auroral-latitude regions based on their magnetic latitude . A station is assigned to the low- and mid-latitude region if , and to the high-latitude region otherwise. The auroral-latitude region is defined as . For low- and mid-latitude stations, we calculate the HSS of d predictions using thresholds of 50, 100, and 200 nT. For high-latitude stations, we use thresholds of 50, 200, 300, and 400 nT due to higher disturbance levels. Table 4 shows the regional medians of percentiles in magnetometer station observations corresponding to the selected HSS thresholds during storm periods. The calculation includes 22 storms in 2015 (i.e., test set 1). We see that the largest threshold of 400 nT corresponds to at least the 95th percentile of observations, thus evaluating whether the model can predict the “tail” of the distribution during storm periods. In contrast, the smallest threshold of 50 nT evaluates the model performance around the median of the observations in the high-latitude regions, and around the 70th percentile when all latitudes are considered.
2011-08-05 18:02 | 2015-02-16 19:24 | 2015-03-17 04:07 | 2015-04-09 21:52 |
2015-04-14 12:55 | 2015-05-12 18:05 | 2015-05-18 10:12 | 2015-06-07 10:30 |
2015-06-22 05:00 | 2015-07-04 13:06 | 2015-07-10 22:21 | 2015-07-23 01:51 |
2015-08-15 08:04 | 2015-08-26 05:45 | 2015-09-07 13:13 | 2015-09-08 21:45 |
2015-09-20 05:46 | 2015-10-04 00:30 | 2015-10-07 01:41 | 2015-11-03 05:31 |
2015-11-06 18:09 | 2015-11-30 06:09 | 2015-12-19 16:13 | 2024-05-10 18:00 |
HSS thresholds (nT) | 50 | 100 | 200 | 300 | 400 |
---|---|---|---|---|---|
Percentiles of low and mid latitudes | 76th | 95th | 100th | 100th | 100th |
Percentiles of high latitudes | 49th | 68th | 91st | 97th | 98th |
Percentiles of auroral latitudes | 47th | 65th | 82nd | 90th | 95th |
Percentiles of all latitudes | 71st | 91st | 98th | 99th | 100th |
4.2 Test Set 1: Statistical Comparison Between GeoDGP and Geospace Models
Test set 1 includes a total of 22 geomagnetic storms in 2015, which covers a wide range of events. Specifically, the median minimum Dst is nT, while the lowest minimum Dst is nT. On this test set, we evaluate and compare our GeoDGP predictions with the Geospace simulations performed by Al Shidi et al. (2022). The total number of stations being studied is 222. The regional categorization assigns 81 stations to the low- and mid-latitude region, 141 stations to the high-latitude region, and 53 stations to the auroral latitude region. We report the evaluation metrics of the two models in Table 5. The metrics of stations within each region are summarized by their median values. The model with the best performance is bolded.
B | Region | Metric | Geospace | GeoDGP (NoDst, OP) | GeoDGP (MP, OP) |
---|---|---|---|---|---|
(, 45 min) | + 1 hr ( + 1 hr, 105 min) | ||||
d | Low and mid latitudes; ; 81 Stations | HSS (50 nT) | 0.57 | 0.74 (0.51, 0.75) | 0.71 (0.69, 0.61) |
HSS (100 nT) | 0.55 | 0.75 (0.36, 0.75) | 0.72 (0.69, 0.65) | ||
HSS (200 nT) | 0.00 | 0.32 (0.00, 0.39) | 0.06 (0.14, 0.30) | ||
MAE | 15 | 10 (15, 9) | 11 (12, 13) | ||
d | MAE | 16 | 11 (-, 9) | 12 (13, 14) | |
SA | 0.84 | 0.90 (-,0.92) | 0.90 (0.89, 0.88) | ||
d | MAE | 11 | 10 (-, 6) | 10 (10, 9) | |
SA | 0.58 | 0.63 (-, 0.81) | 0.63 (0.63, 0.73) | ||
d | High latitudes; ; 141 Stations | HSS (50 nT) | 0.44 | 0.57 (0.53, 0.59) | 0.49 (0.52, 0.44) |
HSS (200 nT) | 0.31 | 0.56 (0.50, 0.48) | 0.49 (0.48, 0.31) | ||
HSS (300 nT) | 0.26 | 0.48 (0.46, 0.39) | 0.40 (0.38, 0.22) | ||
HSS (400 nT) | 0.20 | 0.42 (0.38, 0.32) | 0.32 (0.30, 0.14) | ||
MAE | 74 | 46 (46, 46) | 52 (53, 57) | ||
d | MAE | 86 | 53 (-, 52) | 55 (59, 68) | |
SA | 0.69 | 0.78 (-, 0.79) | 0.76 (0.76, 0.72) | ||
d | MAE | 44 | 35 (-, 35) | 37 (38, 45) | |
SA | 0.59 | 0.70 (-, 0.75) | 0.68 (0.68, 0.69) | ||
d | Auroral latitudes; ; 53 Stations | HSS (50 nT) | 0.49 | 0.57 (0.55, 0.61) | 0.49 (0.53, 0.45) |
HSS (200 nT) | 0.42 | 0.59 (0.56, 0.50) | 0.54 (0.53, 0.32) | ||
HSS (300 nT) | 0.34 | 0.55 (0.53, 0.44) | 0.50 (0.46, 0.26) | ||
HSS (400 nT) | 0.27 | 0.51 (0.48, 0.38) | 0.47 (0.41, 0.20) | ||
MAE | 80 | 68 (70, 69) | 75 (79, 93) | ||
d | MAE | 95 | 70 (-, 74) | 74 (81, 102) | |
SA | 0.69 | 0.78 (-, 0.79) | 0.76 (0.76, 0.71) | ||
d | MAE | 45 | 37 (-, 40) | 38 (39, 49) | |
SA | 0.56 | 0.66 (-, 0.72) | 0.63 (0.63, 0.65) | ||
d | All latitudes; ; 222 Stations | HSS (50 nT) | 0.49 | 0.63 (0.53, 0.64) | 0.57 (0.57, 0.49) |
HSS (200 nT) | 0.24 | 0.53 (0.43, 0.47) | 0.43 (0.42, 0.31) | ||
HSS (300 nT) | 0.22 | 0.44 (0.41, 0.36) | 0.34 (0.33, 0.19) | ||
HSS (400 nT) | 0.19 | 0.41 (0.36, 0.32) | 0.29 (0.29, 0.14) | ||
MAE | 27 | 23 (26, 18) | 26 (26, 24) | ||
d | MAE | 30 | 21 (-, 19) | 23 (24, 26) | |
SA | 0.71 | 0.82 (-, 0.83) | 0.80 (0.80, 0.78) | ||
d | MAE | 23 | 18 (-, 17) | 19 (19, 21) | |
SA | 0.58 | 0.67 (-, 0.77) | 0.66 (0.67, 0.70) |
- Note. Median HSS, MAE [nT] and sign accuracy (SA) in various magnetic latitude regions for Geospace, GeoDGP, observation persistence (OP), GeoDGP without Dst (only for predicting d) (NoDst), and GeoDGP model persistence (MP). The corresponding lead time for each model is shown below the model name. The model with the best performance is bolded.
We begin by discussing the performance of GeoDGP with a lead time of . For the all-latitude global prediction of d, GeoDGP achieves a median HSS of 0.63 with the 50 nT threshold, outperforming the Geospace model's HSS of 0.49. For thresholds of 200 nT and above, GeoDGP achieves medium HSS values that roughly double those from the Geospace model: GeoDGP's 0.53 versus Geospace's 0.24 at 200 nT threshold, GeoDGP's 0.44 versus Geospace's 0.22 at 300 nT threshold, and GeoGDP's 0.41 versus Geospace's 0.19 at 400 nT threshold. Additionally, GeoDGP outperforms OP in HSS with thresholds of 200 nT and above, while achieving comparable HSS with the 50 nT threshold. In contrast, the Geospace model's HSS falls below that of OP. These results illustrate the strong performance of GeoDGP in predicting high-level disturbances, and suggest that the Geospace model has a tendency to underpredict, as also noted in Al Shidi et al. (2022). GeoDGP also has a smaller MAE of 23 nT compared to 27 nT of Geospace, although it is larger than the 18 nT of OP which is expected given the short time scale.
For regional comparisons, GeoDGP consistently outperforms the Geospace model and shows higher HSS and lower MAE across all regions. As the threshold increases from 50 to 200 nT, the HSS for both models decreases more rapidly in the low- and mid-latitude region than in the high-latitude region. This is due to the relative rarity of high-level disturbances in the low- and mid-latitude region, where a threshold of 200 nT corresponds to the upper extreme of the distribution (i.e., the 100th percentile as shown in Table 4), making such events inherently more difficult to predict. Figure 2 illustrates this by showing GeoDGP's HSS at thresholds of 50 and 200 nT for each station. The HSS of high-latitude stations remains high under both thresholds, whereas the HSS of low- and mid-latitude stations exhibits noticeable drops as the threshold increases. This shows that for general cases (non-rare events), GeoDGP does not underpredict high-level disturbances. Figure 3 illustrates how HSS varies with magnetic latitude under four different thresholds. Across all latitudes, GeoDGP consistently achieves higher HSS compared to Geospace. For both models, increasing the threshold leads to narrower station distributions concentrating around higher magnetic latitudes, that is, closer to the auroral zone, as expected. Table 5 shows that the increasing level of disturbances with magnetic latitude also results in a corresponding increase in MAE, and GeoDGP achieves lower MAE values compared to Geospace. Moreover, GeoDGP achieves comparable or higher HSS than OP across all regions. Given that most ground magnetometer station data is not available in real time and that global station coverage is uneven, the stable performance of GeoDGP underscores its potential for real-time forecasting of local ground magnetic field perturbations worldwide.

Heidke Skill Score (HSS) of GeoDGP with prediction lead time for individual magnetometer stations. The top and bottom panels show the HSS with thresholds of 50 nT (top) and 200 nT (bottom), respectively. The evaluation included 22 geomagnetic storms in 2015. The black solid curves represent the magnetic latitude gridlines that separate the low- and mid-latitude region from the high-latitude region. The shaded areas represent the - auroral-latitude region.

Heidke Skill Score (HSS) of GeoDGP (red) and Geospace (blue) for individual magnetometer stations by magnetic latitude. The evaluation included 22 geomagnetic storms in 2015. Both models have lead time . The four panels plot HSS under thresholds of 50, 200, 300, and 400 nT. The shaded areas represent the auroral-latitude region between and . The dashed line indicates the magnetic latitude, which separates the high-latitude region from the low- and mid-latitude region.
Predicting d and d is more challenging due to the added difficulty in determining the correct sign of these components. The regional comparisons show that both models have higher MAE values in predicting d compared to d, despite the fact that the magnitude of d is always smaller than that of d per Equation 1. Table 5 shows that GeoGDP achieves a SA of 0.82 for d in the global all-latitude case, compared to 0.71 of Geospace. With a higher SA, GeoDGP achieves a d prediction performance closer to that of d, compared to Geospace. Specifically, across all latitude regions, GeoDGP achieves a MAE of 21 nT (compared to 30 nT for Geospace) for predicting d and 23 nT (compared to 27 nT for Geospace) for d. While neither model does better than OP in predicting d, GeoDGP comes much closer. For d, the MAE values of both models are much smaller due to d’s lower average magnitude, while GeoDGP still outperforms Geospace. However, the SA of d is low for both models, with GeoDGP scoring below 0.7 and Geospace below 0.6, both significantly lower than the 0.77 from OP. Accurately predicting d thus remains a challenge. This is consistent with the general understanding of the configurations of the current systems leading to these ground magnetic perturbations. For example, the classical substorm current wedge configuration (McPherron et al., 1973) predicts dominant northward perturbations at a mid-latitude station on the night side but eastward (westward) perturbations on the west (east) part of the current wedge.
Extending the forecast lead time of GeoDGP from to results in decreased model performance, as expected. However, as shown in Table 5, GeoDGP with still outperforms Geospace with across all evaluation metrics. It also outperforms the OP with significantly higher HSS in predicting d, lower MAE and higher SA in predicting d, and comparable metrics in predicting d. Moreover, GeoDGP with demonstrates better or similar performance compared to the 1 hr GeoDGP MP. While we explored extending GeoDGP's forecast lead time to and beyond, the performance declined significantly. Therefore, we conclude that the current GeoGDP configuration provides reliable predictions up to hr into the future, which corresponds to a forecast range of 1.5–2 hr in practice, depending on the solar wind speed.
Additionally, Table 6 shows the evaluation of the probabilistic aspect of GeoDGP using the regional median of coverage rate, average interval width, and average interval score of the prediction intervals. Figure 4 further illustrates how these measures vary with magnetic latitude for time ahead d predictions. Although the empirical coverage of 98% is close to the nominal rate of 95%, we observe an uneven distribution of coverage rate across magnetic latitude, likely due to regional differences in disturbance levels. The coverage rate in the low-and-mid latitude is near 100%, indicating overly wide prediction intervals, whereas in the auroral latitude region, the coverage rate falls well below the nominal rate. As a result, we also find significantly higher interval score in the auroral latitude region compared to rest of the high-latitude region. The interval score differs from the interval width by including a penalty term for missed observation coverage. This indicates that predictions in the auroral-latitude region remain challenging. On the other hand, the presence of spatial heterogeneity in predictive uncertainty is clearly demonstrated by the increasing average interval width from lower to higher magnetic latitudes. Table 6 also shows that the interval scores of all latitudes predictions for ahead are consistently lower than the predictions for + 1 hr ahead as expected.
Region | Metric | d, d, dTS | d, d, dTS + 1 hr |
---|---|---|---|
Low and mid latitudes | Coverage Rate | 99%, 99%, 99% | 99%, 99%, 99% |
Interval Width | 131, 217, 170 | 142, 237, 174 | |
Interval Score | 132, 217, 170 | 143, 237, 175 | |
High latitudes | Coverage Rate | 91%, 88%, 93% | 91%, 90%, 92% |
Interval Width | 171, 225, 173 | 188, 246, 176 | |
Interval Score | 409, 589, 346 | 450, 576, 378 | |
Auroral latitudes | Coverage Rate | 81%, 83%, 91% | 82%, 84%, 90% |
Interval Width | 175, 229, 174 | 192, 248, 177 | |
Interval Score | 837, 832, 424 | 832, 832, 449 | |
All latitudes ; | Coverage Rate | 98%, 98%, 98% | 98%, 98%, 98% |
Interval Width | 141, 218, 170 | 154, 239, 175 | |
Interval Score | 185, 269, 206 | 216, 298, 216 |
- Note. The nominal coverage rate is 95%.

Sample averaged coverage rate, interval width and interval score of GeoDGP d probabilistic prediction versus magnetic latitude. The prediction has a lead time of , and the evaluation included 22 geomagnetic storms in 2015. Each point represents a station. The average coverage rate is 98% (black horizontal line in Panel 1), while the nominal rate is 95% (red horizontal line in Panel 1). The shaded areas represent the auroral-latitude region between and . The dashed line indicates the magnetic latitude, which separates the high-latitude region from the low- and mid-latitude region.
As a final note in this section, we highlight that the only difference in model input between GeoDGP and Geospace is the inclusion of the Dst index, which characterizes the overall disturbance level of the magnetosphere. To demonstrate the benefit of including the Dst index for predicting d, we train a model using the same architecture of GeoDGP but without the Dst index as an input. The results, shown in Table 5, indicate that the performance of the modified model to be consistently worse than GeoDGP. However, it still outperforms the Geospace model. This finding demonstrates the value of including the Dst index as an input and illustrates the advantage of the data-driven GeoDGP approach over the physics-based first-principles Geospace model. It also aligns with prior studies (e.g., Smith et al., 2020), which demonstrate that including state-characterizing variables (e.g., the Dst index) of magnetospheric systems can improve predictive performance.
4.3 Test Set 2: Comparison Between GeoDGP and DAGGER
Test set 2 consists of two geomagnetic storms: the 2011-08-05 storm with a minimum Dst of nT and the 2015-03-15 storm with a minimum Dst of nT. This test set is used to compare GeoDGP with the DAGGER model, which operates with a forecast horizon of . We show that, compared to DAGGER, GeoDGP not only extends the forecast horizon by 30 min but also achieves better performance. The evaluation covers a total of 178 stations, all located in the Northern Hemisphere with magnetic latitudes greater than 40, since the DAGGER model has only been evaluated on this subset.
Table 7 presents the evaluation metrics for both models. GeoDGP with a lead time shows a significant advantage over DAGGER in the all-latitude global prediction of d. Specifically, GeoDGP achieves a median HSS of 0.36 with the 50 nT threshold (compared to DAGGER's 0.22), 0.49 with the 200 nT threshold (compared to DAGGER's 0.05), 0.38 with the 300 nT threshold (compared to DAGGER's 0.00) and 0.34 with the 400 nT threshold (compared to DAGGER's 0.00). These results indicate that GeoDGP with remains effective in predicting high-level disturbance, whereas DAGGER performs no better than a random forecast at high thresholds (i.e., zero or near-zero HSS values). GeoDGP also achieves a lower MAE of 64 and 69 nT (compared to DAGGER's 82 and 76 nT) in predicting d and d, respectively, while achieving a comparable MAE of 45 nT (compared to DAGGER's 44 nT) in predicting d. Similar trends are observed in regional comparisons, where GeoDGP consistently outperforms DAGGER. Comparisons with OP reaffirm the findings from Section 4.2, showing that GeoDGP with outperforms OP. Additionally, GeoGDP either matches or exceeds the performance to the 1 hr GeoGDP MP.
B | Region | Metric | DAGGER | GeoDGP (MP, OP) |
---|---|---|---|---|
+ 30 min | + 1 hr ( + 1 hr, 105 min) | |||
d | Low and mid latitudes; ; 26 Stations | HSS (50 nT) | 0.15 | 0.63 (0.62, 0.55) |
HSS (100 nT) | 0.03 | 0.73 (0.70, 0.64) | ||
HSS (200 nT) | 0.00 | 0.02 (0.01, 0.34) | ||
MAE | 49 | 16 (16, 22) | ||
d | MAE | 45 | 16 (18, 23) | |
SA | 0.83 | 0.96 (0.95, 0.93) | ||
d | MAE | 21 | 20 (18, 19) | |
SA | 0.70 | 0.74 (0.74, 0.79) | ||
d | HSS (50 nT) | 0.25 | 0.31 (0.34, 0.31) | |
High latitudes; ; 152 Stations | HSS (200 nT) | 0.07 | 0.52 (0.43, 0.22) | |
HSS (300 nT) | 0.00 | 0.39 (0.33, 0.09) | ||
HSS (400 nT) | 0.00 | 0.35 (0.20, 0.02) | ||
MAE | 86 | 72 (83, 95) | ||
d | MAE | 82 | 77 (86, 107) | |
SA | 0.75 | 0.79 (0.78, 0.72) | ||
d | MAE | 50 | 50 (52, 63) | |
SA | 0.67 | 0.68 (0.69, 0.68) | ||
d | HSS (50 nT) | 0.32 | 0.31 (0.36, 0.32) | |
Auroral latitudes; ; 64 Stations | HSS (200 nT) | 0.22 | 0.51 (0.43, 0.20) | |
HSS (300 nT) | 0.18 | 0.48 (0.40, 0.13) | ||
HSS (400 nT) | 0.05 | 0.50 (0.41, 0.06) | ||
MAE | 98 | 88 (96, 118) | ||
d | MAE | 98 | 92 (106, 137) | |
SA | 0.75 | 0.78 (0.76, 0.70) | ||
d | MAE | 54 | 55 (58, 70) | |
SA | 0.62 | 0.66 (0.67, 0.65) | ||
d | HSS (50 nT) | 0.22 | 0.36 (0.39, 0.37) | |
All latitudes; ; 178 Stations | HSS (200 nT) | 0.05 | 0.49 (0.40, 0.23) | |
HSS (300 nT) | 0.00 | 0.38 (0.32, 0.07) | ||
HSS (400 nT) | 0.00 | 0.34 (0.19, 0.02) | ||
MAE | 82 | 64 (72, 87) | ||
d | MAE | 76 | 69 (80, 97) | |
SA | 0.76 | 0.82 (0.81, 0.74) | ||
d | MAE | 44 | 45 (47, 53) | |
SA | 0.67 | 0.68 (0.69, 0.69) |
- Note. Median HSS, MAE [nT] and sign accuracy in various magnetic latitude regions of DAGGER, GeoDGP, GeoDGP model persistence (MP) and observation persistence (OP). The corresponding lead time for each model is shown below the model name. The model with the best performance is bolded. All stations are in the Northern Hemisphere with .
To further evaluate the model's ability to capture temporal behavior of perturbations during a geomagnetic storm, we present station-wise d predictions for the 2011-08-05 storm in Figure 5 using the same six stations analyzed by Pulkkinen et al. (2013) (see Table 8 for station details). Due to the lack of data for station PBQ during this storm, we replace it with station MEA, which has a similar magnetic latitude and is located in a densely populated area. Figure 5 shows that while both GeoDGP and DAGGER capture the major peaks and troughs in the observations, DAGGER has a tendency to underpredict large perturbations while GeoDGP provides predictions of a similar magnitude to the observation even at a larger forecast horizon. Additionally, compared to DAGGER, the predictive mean of GeoDGP exhibits lower-amplitude oscillations superimposed on the prediction trend, with periods of several hours that generally align well with observations. Lastly, GeoDGP provides probabilistic forecast of the perturbation values that reflect their uncertainty, whereas DAGGER does not offer this capability.

Test set 2: The 2011-08-05 storm. observations (black) and predictions using GeoDGP (red) and DAGGER (green) at six stations. GeoDGP has a lead time of and DAGGER has a lead time of . The shaded area represents the 95% prediction intervals of GeoDGP.
4.4 Test Set 3: The May 2024 Gannon Extreme Storm
The May 2024 Gannon storm was an extreme G5 geomagnetic storm caused by multiple Earth-directed coronal mass ejections and associated M and X-class solar flares. It was the strongest geomagnetic storm in the last 20 years, with a minimum Dst of nT. In this section, we showcase the advantages of GeoDGP over a well-tuned Geospace simulation, and demonstrate GeoDGP to produce better or comparable forecast than OP, even under these extreme space weather conditions. Due to several data gaps in the OMNI data set during this storm, we use the ballistically propagated 1-min solar wind measurements from the NASA Advanced Composition Explorer (ACE) (Stone et al., 1998) satellite as input to GeoDGP. The Geospace simulation uses WIND (Acuña et al., 1995) satellite observations, which have similar and measurements as those from ACE but have more accurate plasma density measurements, as shown in Figure 6. However, since GeoDGP is trained using OMNI data set that largely consists of ACE data, the model may already possess the ability to adjust to ACE input. The Geospace model uses a reduced inner boundary radius of 1.9 (instead of the default 2.5 ) and the inner boundary density is set to CPCP (instead of the default CPCP), where the density is measured in units of amu and the cross polar cap potential CPCP in kV. We report evaluation metrics in Table 9 over a total of 206 stations covering both hemispheres of the Earth.

Test set 3: 2024-05-10 Gannon extreme storm. Time series of the SMR index, solar wind speed , plasma density , and , , and . The minimum SMR index value of nT was recorded at 2024-05-10 22:36:00 UTC. GeoDGP used Advanced Composition Explorer (red) for model input and Geospace uses WIND (blue) satellite observations as model input.
B | Region | Metric | Geospace | GeoDGP (OP) | GeoDGP (MP, OP) |
---|---|---|---|---|---|
(45 min) | + 1 hr ( + 1 hr, 105 min) | ||||
d | Low and mid latitudes; ; 83 Stations | HSS (50 nT) | 0.36 | 0.86 (0.80) | 0.78 (0.80, 0.76) |
HSS (100 nT) | 0.62 | 0.73 (0.76) | 0.72 (0.71, 0.70) | ||
HSS (200 nT) | 0.60 | 0.81 (0.77) | 0.80 (0.74, 0.69) | ||
MAE | 55 | 35 (34) | 36 (44, 47) | ||
d | MAE | 55 | 41 (35) | 39 (46, 50) | |
SA | 0.93 | 0.94 (0.96) | 0.93 (0.93, 0.93) | ||
d | MAE | 36 | 34 (22) | 35 (34, 29) | |
SA | 0.53 | 0.60 (0.86) | 0.60 (0.62, 0.79) | ||
d | High latitudes; ; 123 Stations | HSS (50 nT) | 0.45 | 0.33 (0.56) | 0.26 (0.24, 0.44) |
HSS (200 nT) | 0.62 | 0.71 (0.59) | 0.62 (0.62, 0.46) | ||
HSS (300 nT) | 0.51 | 0.62 (0.50) | 0.51 (0.49, 0.35) | ||
HSS (400 nT) | 0.45 | 0.55 (0.45) | 0.37 (0.39, 0.27) | ||
MAE | 149 | 125 (157) | 146 (159, 188) | ||
d | MAE | 194 | 159 (183) | 173 (191, 236) | |
SA | 0.69 | 0.77 (0.78) | 0.75 (0.76, 0.70) | ||
d | MAE | 133 | 99 (111) | 108 (106, 132) | |
SA | 0.60 | 0.72 (0.75) | 0.72 (0.73, 0.69) | ||
d | Auroral latitudes; ; 48 Stations | HSS (50 nT) | 0.50 | 0.30 (0.52) | 0.20 (0.21, 0.40) |
HSS (200 nT) | 0.54 | 0.60 (0.52) | 0.57 (0.58, 0.36) | ||
HSS (300 nT) | 0.48 | 0.57 (0.44) | 0.45 (0.46, 0.32) | ||
HSS (400 nT) | 0.42 | 0.53 (0.42) | 0.38 (0.36, 0.26) | ||
MAE | 179 | 161 (196) | 171 (194, 247) | ||
d | MAE | 206 | 181 (218) | 198 (222, 279) | |
SA | 0.67 | 0.75 (0.77) | 0.72 (0.74, 0.69) | ||
d | MAE | 139 | 106 (126) | 120 (119, 148) | |
SA | 0.60 | 0.72 (0.72) | 0.73 (0.72, 0.69) | ||
d | All latitudes; ; 206 Stations | HSS (50 nT) | 0.38 | 0.53 (0.64) | 0.50 (0.48, 0.53) |
HSS (200 nT) | 0.61 | 0.74 (0.63) | 0.66 (0.66, 0.50) | ||
HSS (300 nT) | 0.48 | 0.61 (0.51) | 0.49 (0.47, 0.35) | ||
HSS (400 nT) | 0.35 | 0.48 (0.40) | 0.33 (0.34, 0.20) | ||
MAE | 108 | 91 (103) | 102 (112, 133) | ||
d | MAE | 131 | 116 (116) | 126 (139, 149) | |
SA | 0.82 | 0.87 (0.87) | 0.86 (0.86, 0.82) | ||
d | MAE | 81 | 67 (74) | 74 (75, 87) | |
SA | 0.58 | 0.70 (0.79) | 0.67 (0.70, 0.72) |
- Note. Median heidke skill score, mean absolute error [nT] and sign accuracy over magnetic latitude (λ) regions for geospace, GeoDGP, observation persistence and GeoDGP model persistence. The corresponding lead time for each model is shown below the model name. The model with the best performance is bolded.
With a forecast horizon of , GeoDGP outperforms Geospace in the all-latitude global prediction, achieving lower median MAE values (91, 116, and 67 nT) for the three perturbation components (H, N, and E) along with higher median HSS values (0.53, 0.74, 0.61, and 0.48 for the 50, 200, 300, and 400 nT thresholds, respectively) when predicting d. Interestingly, the HSS values for both models at thresholds above 200 nT are higher than their counterparts in Table 5. This is likely due to the extreme intensity of the storm, during which high-level disturbances were frequently recorded even at low- and mid-latitude stations, which is uncommon for moderate storms. This intensity is also evident from the rising MAE values, approximately four times larger than those in Table 5 for all three perturbation components. Despite this increase in disturbance intensity, GeoDGP consistently demonstrates its strong predictive power in forecasting high-level disturbances during extreme geomagnetic storms. This conclusion is further supported by the comparisons with OP, where GeoDGP achieves higher HSS values for all thresholds above 200 nT and lower MAE values in predicting d, while having comparable performance in predicting d and d. Regional comparisons also confirm the superiority of GeoDGP over Geospace and OP. We note that, although OP generally perform well at short lead times during minor storms or non-storm periods, it struggles during moderate to extreme geomagnetic storms, such as the Gannon storm. In such cases, it often fails to account for abrupt changes driven by the rapidly evolving magnetic field and solar wind conditions, leading to underestimation or delayed responses. Additionally, GeoDGP predictions with lead time generally achieve higher HSS values in predicting d and lower MAE across all perturbation components compared to Geospace and OP. Although GeoDGP's d HSS is close to that of GeoDGP 1-hr MP, it achieves lower MAE for all components.
We present station-wise predictions of the three perturbation components (H, N and E) with a lead time for the six stations listed in Table 8. Figure 7 first illustrates the predictions of d. All six stations recorded d values exceeding 1,000 nT due to the extreme intensity of the storm. While both the GeoDGP and Geospace models capture the overall trends in the time series, GeoDGP provides predictions closer to observations while Geospace occasionally underpredicts the peaks. However, during periods where the measurements dropped to low values, GeoDGP occasionally misses the dips and performs worse than Geospace in such instances. These results align with the HSS values reported in Table 9. Figures 8 and 9 show the predictions for d and d, respectively. These components are more challenging to predict due to the difficulty in correctly determining their signs. Table 9 shows that both models tend to miss the sign for d more frequently than for d. Both models also tend to underpredict the peaks of d than those of d. These findings indicate the risk of underpredicting d if predictions are derived indirectly via Equation 1 rather than training a model directly on d.

Test set 3: 2024-05-10 Gannon extreme storm, observations (black) and predictions using GeoDGP (red) and Geospace (blue) at six stations. Both models have a lead time of . The shaded area represents the 95% prediction intervals of GeoDGP.

Test set 3: 2024-05-10 Gannon extreme storm, observations (black) and predictions using GeoDGP (red) and Geospace (blue) at six stations. Both models have a lead time of . The shaded area represents the 95% prediction intervals of GeoDGP.

Test set 3: the 2024-05-10 Gannon extreme storm, observations (black) and predictions using GeoDGP (red) and Geospace (blue) at six stations. Both models have a lead time of . The shaded area represents the 95% prediction intervals of GeoDGP.
Next, we compare GeoDGP and Geospace based on the global perturbation maps of d during the peak time of the Gannon extreme storm, as indicated by when the SMR index reaches its minimum value of nT at 2024-05-10 22:36:00 UTC. The SMR index, shown in Figure 6, is a composite ring current index that can be viewed as a high spatiotemporal resolution counterpart to the Dst index (Newell & Gjerloev, 2012). It is produced at 1-min cadence instead of Dst's hourly cadence, and uses 98 low- and mid-latitude magnetometers instead of Dst's 4. The global maps at the peak time are shown in Figure 10. Magnetometer stations are denoted by circles, with station observations indicated by the circle fill color. The background contour represents the global model predictions, and the shaded areas represent the night-side of Earth.

Test set 3: 2024-05-10 Gannon extreme storm. Model predictions of (background contour) from GeoDGP (top) and Geospace (bottom) with lead time . The maps correspond to the time of minimum SMR at 2024-05-10 22:36:00 UT. Circles represent magnetometer stations, with fill color indicating their observations. The shaded areas represent the night-side of Earth.
Qualitatively, the predictions from both GeoGDP and Geospace correlate well with station observations worldwide. However, the observations show some localized disturbances that disagree with both model's predictions, such as the small perturbations measured at geographic latitude of approximately 80 N near northern Norway. We also observe that the night-side shows stronger observed perturbations compared to the day-side, which is fully expected due to the auroral electrojets being stronger toward the night-side than on the day-side. Also, there is a clear dawn-dusk asymmetry with stronger ground magnetic perturbations seen on the dawn-side, which is likely due to extremely large negative IMF during this time as shown in Figure 6. Since only a few observations are available in the high-latitude region between 45 E and 180 E longitude in the Northern Hemisphere, we include FAC data from the Active Magnetosphere and Planetary Electrodynamics Response Experiment (AMPERE) in Figure 11 for reference, alongside predictions from both models in polar projection. The AMPERE FAC shows a clear dawn-dusk asymmetry with stronger FACs on the dawn-side at this time, and both model predictions capture this asymmetry. Meanwhile, the Geospace model shows roughly interhemispheric symmetry in the magnitude of predictions, while GeoDGP displays an asymmetric pattern. Although station observations in the Southern Hemisphere show smaller perturbations than those in the Northern Hemisphere and align more closely with GeoDGP predictions, we note that the Southern Hemisphere has much sparser station coverage compared to the Northern Hemisphere. In particular, there are almost no stations in the southern auroral zone during the peak of this storm. To quantitatively study and compare the performance of the two models in each hemisphere, we reevaluated the model predictions on test set 1 separately for both hemispheres. The results are shown in Table 10. The comparisons show that GeoDGP achieves higher HSS and lower MAE values in both hemispheres than Geospace. However, comparisons in the Southern Hemisphere may not be comprehensive due to the lack of measurements. Furthermore, the Geospace prediction appears less smooth and has more localized disturbances near both pole regions. Interestingly, some of features in the Northern Hemisphere agree well with the station observations. For example, the dip in the perturbations observed at stations in North America, distributed around geographic latitudes of approximately 50 N, is well captured by Geospace, whereas the smoother solution from GeoDGP misses.

Test set 3: 2024-05-10 Gannon extreme storm. AMPERE radial current density observations (top) and model predictions of (background contour) from GeoDGP (middle) and Geospace (bottom) with lead time . The maps correspond to the time of minimum SMR at 2024-05-10 22:36:00 UT. Circles represent magnetometer stations, with fill color indicating their observations. The shaded areas represent the night-side of Earth.
Region | Metric | Geospace | GeoDGP | ||
---|---|---|---|---|---|
North | South | North | South | ||
Low and mid latitudes; ; (52 N; 29 S) | HSS (50 nT) | 0.57 | 0.59 | 0.72 | 0.74 |
HSS (100 nT) | 0.54 | 0.58 | 0.75 | 0.74 | |
HSS (200 nT) | 0.00 | 0.02 | 0.30 | 0.54 | |
MAE | 15 | 15 | 11 | 10 | |
High latitudes; ; (126 N; 15 S) | HSS (50 nT) | 0.44 | 0.31 | 0.57 | 0.55 |
HSS (200 nT) | 0.31 | 0.25 | 0.56 | 0.52 | |
HSS (300 nT) | 0.26 | 0.28 | 0.49 | 0.38 | |
HSS (400 nT) | 0.19 | 0.28 | 0.44 | 0.29 | |
MAE | 76 | 73 | 48 | 39 | |
Auroral latitudes; ; (51 N; 2 S); | HSS (50 nT) | 0.49 | 0.44 | 0.57 | 0.56 |
HSS (200 nT) | 0.42 | 0.30 | 0.59 | 0.48 | |
HSS (300 nT) | 0.34 | 0.27 | 0.55 | 0.35 | |
HSS (400 nT) | 0.27 | 0.22 | 0.53 | 0.28 | |
MAE | 80 | 80 | 68 | 62 | |
All latitudes; ; (178 N; 44 S); | HSS (50 nT) | 0.48 | 0.53 | 0.62 | 0.72 |
HSS (200 nT) | 0.24 | 0.21 | 0.53 | 0.53 | |
HSS (300 nT) | 0.23 | 0.15 | 0.46 | 0.18 | |
HSS (400 nT) | 0.19 | 0.24 | 0.44 | 0.29 | |
MAE | 36 | 16 | 31 | 11 |
- Note. Interhemispheric comparisons of model performance for d prediction based on median HSS and MAE [nT]. Both model feature a forecast horizon of .
4.5 Evaluation of Model Performance Based on Transformer Heating Proxy
Lastly, we use the Gannon extreme storm as an example and discuss the practical use of the GeoDGP model in mitigating the risk of GICs. While GeoDGP achieves better evaluation metrics, the model prediction may not be used directly by ground infrastructure operators for risk management. Domain- and task-specific quantities of interest, such as the impact on transformer heating, must be derived from ground magnetic field perturbations to support decision-making and help prevent another Hydro-Quebec blackout event. is a widely used and extensively tested proxy quantity in the community (Mac Manus et al., 2017; Pulkkinen et al., 2013; Viljanen et al., 2001). However, it remains challenging to predict due to its highly variable and often chaotic nature (Kellinsalmi et al., 2022; Tóth et al., 2014). Our current model does not reliably predict at 1-min temporal resolution. Nevertheless, as we demonstrate in this section, the model prediction remains effective for estimating GIC-related risks. We use the transformer heating effect as an example, and show that it may not be primarily driven by the highest-frequency temporal variation in d.
As an initial attempt to provide a more practical performance measure than , we propose to use the bi-hourly integral of the geoelectric field as an estimate of the potential transformer heating effect. This measure assumes a simple power law frequency dependence of the ground conductance and ignores the orientation and other properties of the affected power lines. The methodology is similar to that of Hu et al. (2024). Below, We derive the measure by step.
The heating of the power grid is generated by the GICs caused by the geoelectric field . The geoelectric field is driven by the d variations and can be calculated in Fourier space. We ignore the effect from the downward component of d. We start by calculating the Fast Fourier Transform of d at magnetometer stations. Figure 12 shows the power spectrum of d based on a 2-hr time interval at station ABK during the Gannon storm. The power spectrum of the model prediction and observations are very similar.

Left: The power spectrum of d in a 2-hr time interval starting from the 24th hr after 2024-05-10 at station ABK. Right: The power spectrum of the geoelectric field proxy in the same time window.

Approximating impedance matrix as a scaler with a power-law frequency dependence . Two magnetotelluric sites data near magnetometer station OTT are shown.
Since the characteristic time for transformer heating is about 2 hr, we set = 2 hr. We propose to use these bi-hourly integrals of the weighted power spectrum of d as a proxy for transformer heating effects, based on which we evaluate the accuracy of the model against observations. For the value of we use 0.5 based on MT sites measurements as shown in Figure 13. Figure 12 shows the power spectrum of the geoelectric field proxy at station ABK. The spectral peak of the observation is located around Hz, which corresponds to about 8 min time periods. This suggests that the primary contribution to the transformer heating proxy comes from lower-frequency components of d variations, which are generally more predictable, and that it is not dominated by those of the highest frequency. Further investigations of stations Ottawa (OTT) and Fredericksburg (FRD) support this finding, with the time periods corresponding to the spectral peak typically ranging from 5 to 10 min.
Comparing this quantity between model predictions and observation provides a reasonable estimate of the relative errors in the resulting transformer heating. Figure 14 shows the integral for three stations, ABK at magnetic latitude, OTT at , and FRD at . An advantage of GeoDGP's probabilistic predictions is that the uncertainty, manifested through multiple realizations of the time series, can be propagated to downstream tasks to support decision making. We carry out the uncertainty propagation using Monte Carlo sampling and construct the prediction intervals for the transformer heating proxy. Specifically, for each station, we draw 200 samples from the posterior predictive distribution, use the sample mean to obtain a point estimate, and use the 2.5th and 97.5th sample quantiles as the endpoints to construct the prediction intervals. One artifact we observed in individual realizations, but not in the mean prediction of GeoDGP, is the artificial noise at the 1-min scale, caused by the use of the Matérn-1/2 kernel that places a non-differentiable prior on function values. We apply a boxcar filter of size 3 to the 1-min cadence time series to smooth realizations such that the resulting sample integral proxy is not dominated by high-frequency artifacts. From Figure 14, we find both models to have significant errors in the heating amplitudes compared to the values derived from observations, and similarly the prediction intervals of GeoDGP failed to achieve the nominal coverage rate. On the other hand, both models capture the timing of heating incidents quite well, suggesting their potential practical use for power grid operators. It is very reassuring to see that the large regional differences both in magnitude and timing are well captured by both models.

Using fast Fourier transform to evaluate the potential impact of d variations on heating transformers during the Gannon extreme storm. A comparison of the proxy quantity derived from the GeoDGP prediction (red), the Geospace prediction (blue), and observations from three magnetometer stations (black) is shown. The shaded area represents the 95% prediction interval of GeoDGP. Both GeoDGP and Geospace have a prediction lead time of .
5 Conclusions
In this study, we developed GeoDGP, a global, grid-free probabilistic forecasting model for predicting the north (d), east (d), and horizontal (d) components of ground magnetic field perturbations at 1-min time cadence. The model provides predictions with lead times of and . To evaluate its performance, we conducted statistical analysis using 24 geomagnetic storms that span a wide range of space weather conditions, from moderate to extreme. The evaluation also included observations from over 200 magnetometer stations across the globe.
Evaluation on the 22 geomagnetic storms in 2015 showed GeoDGP to consistently outperform the first-principles Geospace model, under both lead times. Notably, GeoDGP with a lead time, which is the same as Geospace, achieved median HSS values of 0.63, 0.53, 0.44, 0.41 for the 50, 200, 300, 400 nT thresholds in predicting d, a median SA of 0.82 in predicting d and 0.67 in predicting d, and median MAE values of 23, 21, and 18 nT in predicting d, d and d, respectively. The model also captures spatial heterogeneity in predictive uncertainty, accounting for regional differences in disturbance levels. A separate evaluation on two selected storms, aimed at comparing with the DAGGER model, showed GeoDGP to not only extend the forecast horizon from DAGGER's to but still maintained significantly better storm period HSS and MAE. The detailed evaluation under the 2024-05-10 Gannon extreme storm demonstrated the consistent performance of GeoDGP under extreme space weather conditions. In all evaluation cases, GeoDGP with both lead times achieved better or similar performance as OP in predicting , highlighted by high HSS values in predicting high-level disturbances.
This paper introduced several novel elements and key contributions. (a) We encoded the location of magnetometer stations in the SM coordinate system, which naturally incorporated diurnal variations. This allowed us to jointly model the data observed from different stations at different times and improved the observational coverage of the surface of Earth. (b) We modeled the non-stationarity of storm period data using a combination of regular and spectral kernels in the hidden layers of the DGP model, achieving significant improvement in predicting high-level disturbances compared to existing methods. (c) We provided predictive uncertainty along with forecasts, leveraging the probabilistic capabilities of the DGP model. (d) We extended the GeoDGP's forecast horizon to 1 hr plus the L1 propagation time, and still achieved better performance than existing models. (e) GeoDGP is computationally efficient to evaluate. We note that for the full grid with 180 (latitude) 360 (longitude) resolution, the model takes an average evaluation time of 4 s on an NVIDIA RTX 4090 GPU (24 GB VRAM) in a system with an Intel Core i9-13900K CPU and 64 GB RAM. This suggests that GeoDGP is suitable for operational deployment and real-time space weather applications. A website (https://csem.engin.umich.edu/GeoDGP) has been developed to provide real-time predictions of ground magnetic field perturbations.
Despite its strong performance, there remains limitations to the GeoDGP model that warrant future work. First, predictions in the auroral zone remains challenging. The model tends to underpredict peaks in d and overpredict troughs in d, while the empirical coverage of three components is significantly lower than the nominal rate of 95%. Second, since we only encode location in SM coordinate system, the model does not account for localized effect of ground conductivity. This exclusion overlooks internal sources of magnetic perturbations arising from currents induced in the solid Earth (Tanskanen et al., 2001). As a result, the model may fail to capture localized enhancements that are operationally significant and have been emphasized in the TPL-007 standard via the supplemental waveform and in reports by the Electric Power Research Institute. While the model supports flexible spatial querying and implies high nominal spatial resolution, its effective resolution is limited by the smoothness of the predictions, especially in comparison to the physics-based Geospace model that resolves finer mesoscale structures. Nonetheless, the smoothness of the GeoDGP predictions serves as a strength in capturing broad spatial trends, making the model well-suited for forecasting regional-scale ground magnetic field perturbations while avoiding overfitting to localized variability. One possible mitigation for the model's resolution limitations is to indirectly encode ground conductivity by including the geographic coordinates of observation sites as additional input. However, extrapolation remains challenging due to the current sparse and uneven observational coverage. A more direct alternative would involve coupling the model with a global ground conductivity model by incorporating features derived from it. Therefore, a thorough investigation into the interplay between model hyperparameters, local geological encoding, and the resulting effective spatial resolution would be a valuable direction for future work. Third, the model performance in the Southern Hemisphere has not been fully evaluated due to the sparsity of measurements. Future efforts aimed at enhancing model accuracy and expanding data sets that target these difficult regions would be highly beneficial. Additionally, tailoring the model to better align with operational practice, such as coupling with models of subsurface conductivity and power network (e.g., Mac Manus et al., 2022), refining the estimation of downstream quantities of interest, and propagating uncertainty in model predictions to support risk-based decision making, would be another valuable direction of future work.
Acknowledgments
This work is supported by the National Science Foundation under Grant PHY-2027555: “SWQU: NextGen SWMF Using Data, Physics and Uncertainty Quantification.” We acknowledge use of NASA/GSFC's Space Physics Data Facility (SPDF)'s OMNIWeb and CDAWeb service (NASA Space Physics Data Facility (SPDF), 2024), and OMNI data (King & Papitashvili, 2020; Papitashvili & King, 2020). We gratefully acknowledge the SuperMAG collaborators (Gjerloev, 2012; Newell & Gjerloev, 2012). The Dst data are provided by the WDC for Geomagnetism, Kyoto and are available through Nose et al. (2015). The magnetotelluric (MT) data (Kelbert, 2020; Kelbert et al., 2019) are publicly available through Kelbert et al. (2011). The model is implemented using Python package GPytorch (Gardner et al., 2018). The SWMF source code is publicly available through Gombosi et al. (2021).
Open Research
Data Availability Statement
The scripts and routines used to produce the results in this manuscript are available in the University of Michigan Library Deep Blue Data Repository at Chen et al. (2024).