Volume 23, Issue 6 e2024SW004301
Research Article
Open Access

GeoDGP: One-Hour Ahead Global Probabilistic Geomagnetic Perturbation Forecasting Using Deep Gaussian Process

Hongfan Chen

Corresponding Author

Hongfan Chen

Department of Mechanical Engineering, University of Michigan, Ann Arbor, MI, USA

Correspondence to:

H. Chen,

[email protected]

Contribution: Conceptualization, Methodology, Software, Validation, Formal analysis, ​Investigation, Data curation, Writing - original draft, Writing - review & editing, Visualization

Search for more papers by this author
Gabor Toth

Gabor Toth

Department of Climate and Space Sciences and Engineering, University of Michigan, Ann Arbor, MI, USA

Contribution: Conceptualization, Methodology, Validation, Formal analysis, ​Investigation, Resources, Data curation, Writing - review & editing, Visualization, Supervision, Project administration, Funding acquisition

Search for more papers by this author
Yang Chen

Yang Chen

Department of Statistics, University of Michigan, Ann Arbor, MI, USA

Contribution: Conceptualization, Methodology, Formal analysis, Writing - review & editing, Supervision, Funding acquisition

Search for more papers by this author
Shasha Zou

Shasha Zou

Department of Climate and Space Sciences and Engineering, University of Michigan, Ann Arbor, MI, USA

Contribution: Conceptualization, Formal analysis, Writing - review & editing, Visualization, Supervision, Funding acquisition

Search for more papers by this author
Zhenguang Huang

Zhenguang Huang

Department of Climate and Space Sciences and Engineering, University of Michigan, Ann Arbor, MI, USA

Contribution: Methodology, Formal analysis, Writing - original draft

Search for more papers by this author
Xun Huan

Xun Huan

Department of Mechanical Engineering, University of Michigan, Ann Arbor, MI, USA

Contribution: Conceptualization, Methodology, Writing - review & editing, Supervision, Funding acquisition

Search for more papers by this author
First published: 07 June 2025

Abstract

Accurately predicting the horizontal component of ground magnetic field perturbation (d B H ${B}_{H}$ ), a key quantity for calculating the geomagnetically induced currents (GICs), is crucial for assessing the space weather impact of geomagnetic disturbances. The current operational first-principles Michigan Geospace model provides effective forecasts of d B H ${B}_{H}$ , but requires significant computational resources to achieve real-time speeds. Existing data-driven methods tend to underpredict d B H ${B}_{H}$ and lack uncertainty quantification, which is either overlooked or treated as secondary. In this work, we introduce GeoDGP, a novel and efficient data-driven model based on the deep Gaussian process. GeoDGP provides global probabilistic forecasts of d B H ${B}_{H}$ with a lead time of at least 1 hr, at 1-min time cadence, and at arbitrary spatial locations. The model takes solar wind measurements, the Dst index, and the prediction location in solar magnetic coordinate system as inputs, and is trained on 28 years of data from SuperMAG global magnetometer stations. Additionally, GeoDGP is also trained to predict the north (d B N ${B}_{N}$ ) and east (d B E ${B}_{E}$ ) components of perturbations. We evaluate GeoDGP's performance at over 200 stations worldwide during 24 geomagnetic storms, including the Gannon extreme storm of May 2024. Comparisons with the first-principles Michigan Geospace model and the data-driven DAGGER model revealed that GeoDGP significantly outperforms both across multiple performance metrics.

Key Points

  • GeoDGP is a data-driven model that provides global probabilistic geomagnetic perturbation forecast at 1 min cadence and 1 hr ahead

  • We evaluate GeoGDP on a wide range of geomagnetic storms and at over 200 magnetometer stations across the globe

  • GeoDGP outperforms both the state-of-the-art first-principles Geospace model and the data-driven DAGGER model across multiple metrics

Plain Language Summary

Ground magnetic field perturbations are crucial for predicting geomagnetically induced currents, which can harm power grids, communication systems, and other ground infrastructure. Accurately predicting these perturbations with high temporal and spatial resolution is critical but remains a significant challenge in space weather forecasting. In this work, we introduce GeoDGP, an advanced data-driven model that provides reliable forecasts of these perturbations at least 1 hr ahead, at 1 min intervals, and can be used for any location. GeoDGP's performance is evaluated across a wide range of geomagnetic storms using data from over 200 magnetometer stations worldwide. The results show that GeoDGP significantly outperforms leading existing first-principle and data-driven models.

1 Introduction

Geomagnetically induced currents (GICs), driven by geomagnetic storms, can significantly impact modern technological systems such as power grids, natural gas pipelines, and telecommunication systems (Boteler, 2001; Eastwood et al., 2018; Pirjola et al., 2000; Pulkkinen et al., 2001). Predicting GICs directly for the full surface of Earth is challenging, primarily due to the limited availability of GIC data. This limitation arises from the proprietary nature of data sets and the spatial-temporal sparsity of measurements. As GICs are driven by the geoelectric field at Earth's surface, they can also be obtained from the horizontal component of ground magnetic field perturbations d B H ${B}_{H}$ , and the ground conductivity (Huang et al., 2004).

Currently, the NOAA Space Weather Prediction Center uses the Michigan Geospace model (referred to as the Geospace model hereafter), a component of the Space Weather Modeling Framework (SWMF; Gombosi et al., 2021; Tóth et al., 2012), to produce ground magnetic field perturbation forecast maps. The first-principles Geospace model solves partial differential equations on various grids, providing high-fidelity and physics-justified predictions but is generally computationally expensive. Its prediction performance of d B H ${B}_{H}$ during storm time was comprehensively studied by Al Shidi et al. (2022), where simulations were conducted for 122 geomagnetic storms from 2010 to 2019 and evaluated at over 300 magnetometer stations around the world. While the model achieved a median Heidke Skill Score (HSS) of 0.45 using a threshold of 50 nT for magnetometers in all latitude regions, the prediction of high-latitude regional disturbances remained challenging.

Several data-driven models have also been developed to predict d B H ${B}_{H}$ , and/or more challengingly the horizontal component of the time derivative of ground magnetic field perturbations ( d B / d t ) H ${(\mathrm{d}\mathbf{B}/\mathrm{d}t)}_{H}$ (Viljanen et al., 2001), following the Geospace Environment Modeling (GEM) challenge that ran from 2008 to 2012 (Pulkkinen et al., 2013). These models are trained on available historic storm data to directly learn the mapping from solar wind and geophysical features to the quantity of interest, bypassing the complexities of explicitly modeling the governing physics. A key advantage of data-driven models is that they are computationally inexpensive to evaluate once the training is finished. As an example, Pinto et al. (2022) experimented with a feed-forward artificial neural network (ANN), a long short-term memory (LSTM) recurrent neural network, and a convolutional neural network (CNN) and compared the performance of these models in predicting ( d B / d t ) H ${(\mathrm{d}\mathbf{B}/\mathrm{d}t)}_{H}$ across six ground magnetometer stations studied in the GEM challenge. Coughlan et al. (2023) explored categorical forecasts to mitigate the heavy-tailed distribution of ( d B / d t ) H ${(\mathrm{d}\mathbf{B}/\mathrm{d}t)}_{H}$ , using a CNN to predict whether it would exceed its 99th percentile threshold 30–60 min in the future. These studies have proven that directly predicting ( d B / d t ) H ${(\mathrm{d}\mathbf{B}/\mathrm{d}t)}_{H}$ from solar wind features is rather challenging due to its highly variable and chaotic nature. As a result, model performance is often evaluated using threshold-based metrics such as skill scores (Pulkkinen et al., 2013).

On the other hand, d B H ${B}_{H}$ is more predictable than ( d B / d t ) H ${(\mathrm{d}\mathbf{B}/\mathrm{d}t)}_{H}$ (Kellinsalmi et al., 2022; Tóth et al., 2014), and it shows a strong correlation with ( d B / d t ) H ${(\mathrm{d}\mathbf{B}/\mathrm{d}t)}_{H}$ that can be approximated by a power law (Tóth et al., 2014). Keesee et al. (2020) and Blandin et al. (2022) used ANN and/or LSTM models to indirectly predict ( d B / d t ) H ${(\mathrm{d}\mathbf{B}/\mathrm{d}t)}_{H}$ (or d B N / d t $\left\vert \mathrm{d}{B}_{N}\right\vert /\mathrm{d}t$ , the time derivative of the magnitude of the northward component) by training models to predict d B H ${B}_{H}$ (or d B N $\left\vert \mathrm{d}{B}_{N}\right\vert $ ). However, these approaches did not consistently outperform benchmark models in predicting ( d B / d t ) H ${(\mathrm{d}\mathbf{B}/\mathrm{d}t)}_{H}$ . In this work, we show that certain GIC-related risks, such as transformer heating, can be quantified directly from d B H ${B}_{H}$ without relying on ( d B / d t ) H ${(\mathrm{d}\mathbf{B}/\mathrm{d}t)}_{H}$ , in a manner similar to Hu et al. (2024). We also show that significant improvement in d B H ${B}_{H}$ prediction can be achieved. Prior work by Iong et al. (2024) developed a Gaussian process (GP) model incorporating contaminated Gaussian noise (Gleason, 1993) to predict maximum d B H $\mathrm{d}{B}_{H}$ over 20-min temporal bins with uncertainty quantification (UQ) at selected stations. These aforementioned approaches mainly focus on a single-station modeling strategy and do not consider or exploit the spatial-temporal correlations between different stations. Upendran et al. (2022) expanded the scope to global prediction, proposing the DAGGER model that couples spherical harmonics with deep learning to generate d B H ${B}_{H}$ predictions for mid- and high-latitude regions in the entire Northern Hemisphere. While these models show an ability to predict well the timing of magnetic perturbations, they tend to consistently underestimate d B H ${B}_{H}$ during intense geomagnetic storms (Camporeale et al., 2020), where d B H ${B}_{H}$ can still be highly variable and with peak values exceeding 1,000 nT. Furthermore, neither the Geospace model nor the DAGGER model systematically integrates predictive uncertainty estimation for global forecasting, which would be valuable for understanding model reliability and support operational decision-making.

We develop a new data-driven model for global d B H ${B}_{H}$ predictions—we call it GeoDGP—based on the deep Gaussian process (DGP) (Damianou & Lawrence, 2013), a Bayesian modeling approach. Specifically, GeoDGP provides global probabilistic forecasts of d B H ${B}_{H}$ at 1-min time cadence and at arbitrary spatial locations. The model takes as input the queried location as well as storm features, which includes solar wind and interplanetary magnetic field (IMF) measurements, solar radio flux measurements (the F 10.7 ${F}_{10.7}$ index), and the disturbance storm time (Dst) index. A distinctive and important idea of GeoDGP is to encode the location of stations relative to the Sun as an input feature, using the solar magnetic (SM) coordinate system. This idea leverages our physical understanding that magnetic perturbation primarily depends on the location in SM coordinates instead of the geographic location (Kivelson & Russell, 1995). As a result, even with the same data set, magnetometer stations effectively provide information over fixed magnetic latitude circles but all magnetic local times as the Earth rotates, instead of single geographical points. This improves coverage over the surface of Earth and enhances spatial interpolation to different locations, potentially improving the model's forecasting capabilities, though it may neglect local ground conductivity and geological effects.

We train GeoDGP to operate with two different lead times. The first lead time is the time it takes for the solar wind and IMF observed near the first Lagrange point (L1) to reach Earth (referred to as T S ${T}_{S}$ hereafter), which varies from about 30 min at high speeds of 800 km/s to around 1 hr at typical speeds of 400 km/s. The second lead time is T S ${T}_{S}$ plus 1 hr. While this longer lead time poses greater challenges due to the difficulty in predicting future solar wind and IMF, it significantly enhances the model's utility for practical applications, offering additional warning time. Moreover, nightside ionospheric current systems, including field-aligned currents (FACs), which are strongly correlated with ground magnetic field perturbations, can respond to solar wind driving with a time lag on the order of an hour (Coxon et al., 2019).

GeoDGP is evaluated on three test data sets consisting of over 200 magnetometer stations and a total of 24 geomagnetic storms, including the recent Gannon extreme storm in May 2024. Results show that GeoDGP predicts perturbations with magnitudes comparable to observations, even during the extreme storm. Furthermore, it consistently outperforms both the Geospace and DAGGER models across multiple metrics, while also capturing spatial heterogeneity in its predictive uncertainty estimation. Notably, the same model architecture can be used to predict the northward component d B N $\mathrm{d}{B}_{N}$ and the eastward component d B E $\mathrm{d}{B}_{E}$ , which are important for calculating the geoelectric field and the resulting GICs. However, while the model supports predictions at arbitrary spatial locations, its effective spatial resolution is constrained by the smoothness imposed by the learned functions.

The remainder of this paper is organized as follows. Section 2 describes the data used in this study and the motivation for global modeling. Section 3 provides details of the GeoDGP model and the model architecture. Section 4 contains key results on model performance and further discussions. Section 5 presents conclusions, limitations, and ideas for future work.

2 Data Cleaning and Preprocessing

In this section, first we show the correlation and underlying clustering structure of magnetometer stations to motivate the global modeling. Then, we introduce the problem setup and describe the model input and output. Finally, we introduce ground magnetic field perturbations data used in this paper and the data cleaning procedure.

2.1 Station Correlation and Clustering

Throughout the paper, the horizontal component of ground magnetic field perturbations—the key quantity of interest in this study—is defined as
d B H = d B N 2 + d B E 2 , $\mathrm{d}{B}_{H}=\sqrt{\mathrm{d}{B}_{N}^{2}+\mathrm{d}{B}_{E}^{2}},$ (1)
where d B N $\mathrm{d}{B}_{N}$ and d B E $\mathrm{d}{B}_{E}$ are the northward and eastward component of ground magnetic field perturbations, respectively. A primary motivation for global modeling stems from the correlations and underlying clustering structures among ground magnetometer stations. To explore these relationships, we perform an initial analysis by estimating the Pearson correlations between stations using d B H ${B}_{H}$ observations from the Gannon extreme storm. The analysis includes a total of S = 206 $S=206$ available stations. The resulting correlation matrix r i j $\left({r}_{ij}\right)$ is transformed to a distance matrix with entries
d i j = 2 1 r i j , i , j = 1 , , S . ${d}_{ij}=\sqrt{2\left(1-{r}_{ij}\right)},\qquad i,j=1,\text{\ldots },S.$ (2)

Then, the matrix d i j $\left({d}_{ij}\right)$ is then used as input for clustering analysis to group stations based on their similarities. In particular, we use the hierarchical agglomerative clustering with complete linkage (Hastie et al., 2009), which progressively merges stations into clusters based on their proximity until a pre-specified distance threshold is met. Using a distance threshold of 1.15, we identify 8 disjoint clusters, as shown in Figure 1, where circles represent stations and their colors indicate cluster assignment. We observe that station clusters mostly align with the magnetic latitude, and correlations are noted between stations in the Northern and Southern Hemispheres that share similar magnetic latitudes (in absolute value). Additionally, the denser distribution of stations in the Northern Hemisphere allows for an observation of longitudinal structure in the clustering results. These structures may reflect underlying magnetospheric processes that are dependent on magnetic local time (MLT), such as the distribution of FACs and auroral electrojets. Moreover, longitudinal variations in ionospheric conductivity, which can be driven by solar illumination, may also affect the ground magnetic perturbations and contribute to the observed spatial patterns. These results suggest that modeling ground magnetic perturbations jointly using data from all stations can be more beneficial than a single-station training strategy.

Details are in the caption following the image

Hierarchical clustering of magnetometer stations based on the correlation of d B H ${B}_{H}$ observations during the Gannon extreme storm. The distribution of station clusters is well aligned with the magnetic latitude grid lines (black and solid).

2.2 Problem Setup

We frame the prediction of ground magnetic field perturbations as a regression task—a supervised learning problem where a model is trained on labeled data to learn the relationship between input features and real-valued outputs. In this work, we train separate models for two lead times: T S ${T}_{S}$ and T S + 1 hr ${T}_{S}+1\,\text{hr}$ , where T S ${T}_{S}$ represents the time it takes for the solar wind and IMF observed at L1 to reach Earth. In each case, the model takes input at time t $t$ and outputs the ground magnetic perturbations at t + T S $t+{T}_{S}$ or t + T S + 1 hr $t+{T}_{S}+1\,\text{hr}$ . The output is produced at 1-min temporal resolution and can be made at arbitrary spatial locations. We train separate models for predicting d B N $\mathrm{d}{B}_{N}$ , d B E $\mathrm{d}{B}_{E}$ , and d B H ${B}_{H}$ .

2.3 Model Input

We group the model input features of GeoGDP into two sets: storm feature input and location input.

The storm feature input captures the drivers of geomagnetic storms. This input set includes the real-time Dst and F 10.7 ${F}_{10.7}$ indexes, along with solar wind and IMF measurements. The Dst index can be obtained from the Kyoto Dst index service, and the other measurements can be sourced from the NASA/GSFC's OMNI data set (Papitashvili & King, 2020). Specifically, we include the solar wind velocity V x $\left({V}_{x}\right)$ in Geocentric Solar Magnetospheric (GSM) coordinates, the proton number density N p $\left({N}_{p}\right)$ , the plasma temperature ( T ) $(T)$ , the IMF B x , B y , B z $\left({B}_{x},{B}_{y},{B}_{z}\right)$ in GSM coordinates, and the solar radio flux at 10.7 cm ( F 10.7 ${F}_{10.7}$ index). Note that the OMNI data are already time-shifted to the magnetosphere's bow shock nose (King & Papitashvili, 2005) from their raw measurement at L1 to better align with ground measurements on Earth. Additionally, we augment these features with their history before feeding to the model. The augmentation is done separately for each measurement due to their different time cadences. Based on empirical experiments, we augment the Dst index with 12 hr of history at hourly intervals, F 10.7 ${F}_{10.7}$ as is without any history due to its daily cadence, and solar wind and IMF measurements with 1 hr of history using 5-min medians based on their 1-min cadence. This input augmentation helps account for uncertainty in the time lag between upstream measurements and ground magnetic responses, while the use of 5-min medians for temporal smoothing reduces input noise and keeps the dimensionality manageable for downstream model training. For all measurements, missing values are imputed based on the last available values if the gap is within 15 min; otherwise, we remove the gap interval from the data set. This 15-min threshold is chosen based on the trade-off between maximizing data availability and maintaining data validity (Smith et al., 2022).

The location input contains the dipole tilt angle θ $\theta $ (the time-dependent angle between the Earth dipole and the Sun-Earth line) and the time-dependent magnetometer station locations (reflect the positions of magnetometer stations relative to the magnetic dipole of Earth and the direction of the Sun). The inclusion of the dipole tilt angle and the IMF components in GSM coordinates allows the model to capture seasonal effects, such as the Russell–McPherron effect (Russell & McPherron, 1973), provided sufficient training data. The station locations are given as the magnetic longitude ϕ $\phi $ (similar to the MLT) and magnetic latitude λ $\lambda $ of each station in the SM coordinate system. The SM coordinates rotate with both a yearly and daily periodicity with respect to inertial coordinates. The Y $Y$ -axis is chosen perpendicular to the Earth-Sun line pointing towards dusk, the Z $Z$ -axis is parallel to the magnetic axis pointing towards the geographic north, and the X $X$ -axis completes the right handed coordinate system such that the Sun is in the half plane determined by the + X ${+}X$ and ± Z $\pm Z$ axes (Russell, 1971). The 0-longitude in SM coordinates contains the subsolar point, and the longitude increases counterclockwise when viewed from the north magnetic pole. Therefore, this setup naturally incorporates periodicity and seasonality into the model input. Additionally, rather than being a fixed geographical point, a station provides information along a curve of fixed magnetic latitude that covers all magnetic local times due to the rotation of Earth. This allows data samples to share the same location input even when observed from stations at different geographic locations at different times. As a result, this approach is conceptually different from training a separate model for each individual station. One challenge with such modeling strategy is that discontinuities in longitude can lead to discontinuities in model predictions. To address this, we use trigonometric functions to transform the longitude ϕ $\phi $ before feeding it to the model to ensure continuity.

The final model input set to GeoDGP is summarized in Table 1.

Table 1. GeoDGP Model Input Features
Symbol Description
V x ${V}_{x}$ x-component of solar wind velocity in GSM coordinates
N p ${N}_{p}$ Proton number density
T $T$ Plasma temperature
B x ${B}_{x}$ , B y ${B}_{y}$ , B z ${B}_{z}$ IMF in GSM coordinates
Dst Disturbance storm time index
F 10.7 ${F}_{10.7}$ Solar radio flux at 10.7 cm
θ $\theta $ Dipole tilt angle
λ $\lambda $ Geomagnetic latitude
cos ( ϕ ) $\cos (\phi )$ Cosine of the geomagnetic longitude in SM coordinates
sin ( ϕ ) $\sin (\phi )$ Sine of the geomagnetic longitude in SM coordinates

During operational forecasting, only the storm feature input needs to be provided in real time, as time-dependent location input can be precomputed. Leveraging the statistical properties of DGP (Damianou & Lawrence, 2013), GeoDGP allows predictions to be made at any location on the surface of Earth, without being restricted to the fixed locations of magnetometer stations.

2.4 Model Output

The model output variables of GeoDGP are the ground magnetic perturbations components ( d B N $\mathrm{d}{B}_{N}$ , d B E $\mathrm{d}{B}_{E}$ , d B H ${B}_{H}$ ). The measurement data of ground magnetic perturbations can be downloaded from the SuperMAG website (Gjerloev, 2012). Raw measurements collected from magnetometer stations around the globe are preprocessed by SuperMAG with a common baseline removal approach that subtracts the daily variations and yearly trend, and transformed to be in the same coordinate system and identical resolution (Gjerloev, 2012). Ground magnetic perturbations are highly variable, making model training at full temporal resolution challenging. To address this, a common preprocessing step in the literature, as seen, for example, in Pulkkinen et al. (2013); Iong et al. (2024), is to summarize d B H ${B}_{H}$ or ( d B / d t ) H ${(\mathrm{d}\mathbf{B}/\mathrm{d}t)}_{H}$ by maximum over 20-min temporal bins. In this work, we preprocess the perturbation data by summarizing each component ( d B N $\mathrm{d}{B}_{N}$ , d B E $\mathrm{d}{B}_{E}$ , d B H ${B}_{H}$ ) by the value with the largest magnitude within 10-min temporal bins. The model is then trained on these summarized values. During testing, the model predictions serve as proxy estimates for the 1-min temporal resolution ground magnetic perturbations, and are evaluated with respect to the true 1-min values. In other words, the model takes input at time t $t$ and predicts the maximum perturbation (by magnitude) over the window t + T , t + T + 10 min $\left[\left.t+T,t+T+10\,\text{min}\right)\right.$ , which is then used as an estimate for the perturbation at time t + T $t+T$ , with T $T$ being the model lead time. The use of summary values provides effective temporal smoothing while preserving perturbation peaks, thereby reducing variability in model parameters across different training data sets. We experimented with different sizes of temporal bins and found empirically that the 10-min bin yields the highest model likelihood on the validation set, which will be defined in the next section.

2.5 Data for Model Training, Validation, and Testing

In this study, we use SuperMAG data from all available stations (approximately 350) spanning 28 years (from 1995 to 2022) to develop the GeoDGP model. The data is further subject to the availability of OMNI data set, which provides the model input. We will carry out our data-driven modeling exclusively on data from storm periods. Specifically, a storm period is identified by first finding the storm peak, defined as the time when Dst reaches its local minimum below 50 ${-}50$  nT. A 120-hr interval is then formed by extending ± $\pm $ 60 hr around this peak. This approach balances the amount of data from quiet periods with that from storm periods.

We adopt standard machine learning procedure to split the data set for training, validation, and testing. In particular, the validation set is used for hyperparameter tuning and early stopping, and the test set is used to evaluate the model performance on new data unseen from the training process. The data set is split as follows.
  • The test set consists of all 22 storms in 2015 (also studied by Al Shidi et al. (2022)), the 2011-08-05 storm (studied by Upendran et al. (2022)), and the 2024-05-10 Gannon extreme storm.

  • The validation set consists of randomly sampled 10% of storms with minimum Dst less than 100 ${-}100$  nT.

  • The training set consists of all remaining storms.

We introduce the GeoDGP model details in the next section.

3 Methodology

The GeoDGP model is developed based on the DGP. We begin by introducing the building blocks of DGP: GP. Then, we generalize GP to DGP and describe the model architecture adopted in this work. Finally, we present and discuss the choice of metrics for model evaluation.

3.1 Gaussian Process Regression

GP regression (Rasmussen & Williams, 2005), also known as kriging (Krige, 1951), provides a flexible and elegant framework for spatial data analysis. Under this framework, an observation of a scaler output y i R ${y}_{i}\in \mathbb{R}$ is associated with its input x i R p ${\mathbf{x}}_{i}\in {\mathbb{R}}^{p}$ through
y i = f x i + ϵ i , ${y}_{i}=f\left({\mathrm{x}}_{i}\right)+{{\epsilon}}_{i},$ (3)
where f $f$ is an unknown latent function and ϵ i N 0 , σ 2 ${{\epsilon}}_{i}\sim \mathcal{N}\left(0,{\sigma }^{2}\right)$ is an independent zero-mean Gaussian noise with variance σ 2 ${\sigma }^{2}$ . Following a Bayesian approach, a GP prior (Neal, 1996) is placed on f $f$ , where any finite collection of function values f x 1 , f x 2 , , f x n $\left\{f\left({\mathbf{x}}_{1}\right),f\left({\mathbf{x}}_{2}\right),\text{\ldots },f\left({\mathbf{x}}_{n}\right)\right\}$ has a multivariate Gaussian distribution. A GP is fully specified by a mean function μ ( x ) = E [ f ( x ) ] $\mu (\mathbf{x})=\mathbb{E}[f(\mathbf{x})]$ and a covariance function k x , x = E ( f ( x ) μ ( x ) ) f x μ x $k\left(\mathbf{x},{\mathbf{x}}^{\prime }\right)=\mathbb{E}\left[(f(\mathbf{x})-\mu (\mathbf{x}))\left(f\left({\mathbf{x}}^{\prime }\right)-\mu \left({\mathbf{x}}^{\prime }\right)\right)\right]$ , and can be written as
f ( ) G P ( μ ( ) , k ( , ) ) . $f(\cdot )\sim \mathcal{G}\mathcal{P}(\mu (\cdot ),k(\cdot ,\cdot )).$ (4)
Given the training data D = x i , y i i = 1 n $\mathcal{D}={\left\{\left({\mathbf{x}}_{i},{y}_{i}\right)\right\}}_{i=1}^{n}$ , the GP prior updates to a GP posterior, and the prediction at input x ${\mathbf{x}}_{\ast }$ has a Gaussian posterior-predictive probability density
p y | x , D = N y ; μ , σ 2 , $p\left({y}_{\ast }\vert {\mathbf{x}}_{\ast },\mathcal{D}\right)=\mathcal{N}\left({y}_{\ast };{\mu }_{\ast },{\sigma }_{\ast }^{2}\right),$ (5)
where
μ = μ x + k X x K X X + σ 2 I n 1 ( y μ ( X ) ) , σ 2 = σ 2 + k x , x k X x K X X + σ 2 I n 1 k X x . \begin{align*}\hfill {\mu }_{\ast }& =\mu \left({\mathbf{x}}_{\ast }\right)+{\mathbf{k}}_{\mathbf{X}}^{\top }\left({\mathbf{x}}_{\ast }\right){\left({\mathbf{K}}_{\mathbf{X}\mathbf{X}}+{\sigma }^{2}{\mathbf{I}}_{n}\right)}^{-1}(\mathbf{y}-\mu (\mathbf{X})),\hfill \\ \hfill {\sigma }_{\ast }^{2}& ={\sigma }^{2}+k\left({\mathbf{x}}_{\ast },{\mathbf{x}}_{\ast }\right)-{\mathbf{k}}_{\mathbf{X}}^{\top }\left({\mathbf{x}}_{\ast }\right){\left({\mathbf{K}}_{\mathbf{X}\mathbf{X}}+{\sigma }^{2}{\mathbf{I}}_{n}\right)}^{-1}{\mathbf{k}}_{\mathbf{X}}\left({\mathbf{x}}_{\ast }\right).\hfill \end{align*} (6)

Here k X x = k x 1 , x , , k x n , x ${\mathbf{k}}_{\mathbf{X}}\left({\mathbf{x}}_{\ast }\right)={\left[k\left({\mathbf{x}}_{1},{\mathbf{x}}_{\ast }\right),\text{\ldots },k\left({\mathbf{x}}_{n},{\mathbf{x}}_{\ast }\right)\right]}^{\top }$ is the n × 1 $n\times 1$ vector of covariances between f x $f\left({\mathbf{x}}_{\ast }\right)$ and the training latent function values, K X X ${\mathbf{K}}_{\mathbf{X}\mathbf{X}}$ is a n × n $n\times n$ covariance matrix with K X X i j = k x i , x j ${\left[{\mathbf{K}}_{\mathbf{X}\mathbf{X}}\right]}_{ij}=k\left({\mathbf{x}}_{i},{\mathbf{x}}_{j}\right)$ , I n ${\mathbf{I}}_{n}$ is the n × n $n\times n$ identity matrix, y = y 1 , , y n $\mathbf{y}={\left[{y}_{1},\text{\ldots },{y}_{n}\right]}^{\top }$ is the vector of training outputs and μ ( X ) = μ x 1 , , μ x n $\mu (\mathbf{X})={\left[\mu \left({\mathbf{x}}_{1}\right),\text{\ldots },\mu \left({\mathbf{x}}_{n}\right)\right]}^{\top }$ is the vector of mean function values at training points.

Applying GP regression to predict ground magnetic field perturbations faces several challenges. First, its computational cost scales as O n 3 $\mathcal{O}\left({n}^{3}\right)$ due to the need to invert the covariance matrix, making it impractical for the SuperMAG data set, which contains hundreds of millions of data entries. Second, the expressive power of a GP model is heavily dependent on the choice of kernel functions. Standard stationary kernels are often inadequate since stationarity and/or smoothness assumptions are violated by storm time disturbances. Iong et al. (2024) mitigated this effect by introducing contaminated Gaussian noise (Gleason, 1993) to the GP with standard stationary kernels when predicting d B H $\mathrm{d}{B}_{H}$ . While this technique improved the coverage of prediction interval, it still underestimated the mean d B H $\mathrm{d}{B}_{H}$ . Alternatively, specifying a non-stationary kernel offers more flexibility but has its own challenges. The complexity of data often requires a richly parameterized kernel, which may be difficult to design, prone to overfitting, and expensive to optimize (Calandra et al., 2016; Duvenaud et al., 2013; Noack et al., 2023; Snoek et al., 2012; Wilson et al., 2016).

3.2 Deep Gaussian Process

DGP (Damianou & Lawrence, 2013), a modern extension of the classical GP, provides a solution to the above limitations but sacrifices exact inference on the unknown function f $f$ . A DGP is a Bayesian model that learns a data representation hierarchy by cascading multiple sparse Gaussian processes (Matthews et al., 2016; Snelson & Ghahramani, 2005; Titsias, 2009). Specifically, each GP layer f ${\mathbf{f}}^{\ell }$ in the DGP is a set of vector-valued functions independently drawn from a GP prior, and the output of the previous layer becomes the input for the next layer:
f ( x ) = f L f L 1 f 2 f 1 ( x ) , \begin{align*}\hfill f(\mathbf{x})={\mathbf{f}}^{L}\left({\mathbf{f}}^{L-1}\left({\cdots}{\mathbf{f}}^{2}\left({\mathbf{f}}^{1}(\mathbf{x})\right){\cdots}\,\right)\right),\end{align*} (7)
with
f d ( ) G P μ ( ) , k ( , ) , d = 1 , 2 , , n , \begin{align*}\hfill {\mathbf{f}}_{d}^{\ell }(\cdot )\sim \mathcal{G}\mathcal{P}\left({\mu }^{\ell }(\cdot ),{k}^{\ell }(\cdot ,\cdot )\right),\quad d=1,2,\text{\ldots },{n}_{\ell },\end{align*} (8)
where L $L$ is the total number of layers, and n ${n}_{\ell }$ is the dimension of the vector-valued f ${\mathbf{f}}^{\ell }$ .

Training is performed by maximizing the marginal likelihood of the model, and the computational challenge is addressed using approximate inference techniques, such as variational inference (VI) (Fox & Roberts, 2012; Matthews et al., 2016; Titsias, 2009), stochastic VI (SVI) (Hoffman et al., 2013) and minibatch subsampling (Hensman et al., 2013). The main idea is to obtain a low-rank approximation to the covariance matrix of each GP layer using a set of inducing points, whose location can be optimized based on training data, and whose number m $m$ is much smaller than the number of training data points n $n$ . This reduces the computational cost to O n m 2 n 1 + n 2 + + n L $\mathcal{O}\left(n{m}^{2}\left({n}_{1}+{n}_{2}+{\cdots}+{n}_{L}\right)\right)$ , making DGP scalable to data sets with millions of observations. In this work, we adopt the doubly stochastic variational approach proposed by Salimbeni and Deisenroth (2017), which improved upon the original DGP (Damianou & Lawrence, 2013) by maintaining the correlation between layers while placing no restrictions on the noise corruptions in each layer. For prediction, the output y ${y}_{\ast }$ is modeled as a mixture of Gaussian, with each component drawn from the variational posterior of latent functions evaluated at the test location x ${\mathbf{x}}_{\ast }$ . We set the number of mixture components to 10.

DGP circumvents the challenge of handcrafting a sophisticated nonstationary kernel (Noack et al., 2023), as one can stack GP layers with stationary kernels to introduce a hierarchical composition of nonlinear transformations to the input space, thereby achieving nonstationarity. We show our choice of model architecture and setup next.

3.3 Model Architecture and Setup

The DGP model architecture used in this work is chosen based on empirical experiments. Specifically, the model consists of one input layer, three hidden layers, and one output layer (i.e., L = 4 $L=4$ ). The dimension of the input vector (introduced in Section 2.3) is 96, while the dimensions of each GP layer are set to be n 1 = 20 ${n}_{1}=20$ , n 2 = 10 ${n}_{2}=10$ , n 3 = 10 ${n}_{3}=10$ , and n 4 = 1 ${n}_{4}=1$ , respectively. The numbers of inducing points we use for each GP layer are 256, 128, 128, 128, respectively. Each GP layer is equipped with a stationary kernel and a mean function.

Since ground magnetic perturbations can be highly variable during storm periods, we employ two different types of stationary kernels. For the first three hidden layers, we use the Matérn class kernel:
k x , x ; ν = 2 1 ν Γ ( ν ) 2 ν x x ν K ν 2 ν x x , \begin{align*}\hfill k\left(\mathbf{x},{\mathbf{x}}^{\prime };\nu \right)=\frac{{2}^{1-\nu }}{{\Gamma }(\nu )}{\left(\frac{\sqrt{2\nu }\left\vert \mathbf{x}-{\mathbf{x}}^{\prime }\right\vert }{\ell }\right)}^{\nu }{K}_{\nu }\left(\frac{\sqrt{2\nu }\left\vert \mathbf{x}-{\mathbf{x}}^{\prime }\right\vert }{\ell }\right),\end{align*} (9)
where ν $\nu $ is a positive parameter that controls the smoothness of the function specified by the user, $\ell $ is the length-scale parameter estimated from training data, Γ ${\Gamma }$ is the Gamma function, and K ν ${K}_{\nu }$ is the modified Bessel function. We use ν = 0.5 $\nu =0.5$ in the first hidden layer and ν = 1.5 $\nu =1.5$ in the second and the third hidden layers. The Matérn kernel with ν = 0.5 $\nu =0.5$ allows the latent representation of inputs to be highly non-smooth, and the Matérn kernel with ν = 1.5 $\nu =1.5$ models functions that are only once differentiable. For the output layer, we use the spectral kernel defined based on its spectral density. Specifically, we use the spectral delta (Lázaro-Gredilla et al., 2010) with the form
k x , x ; R = σ 0 2 R r = 1 R cos 2 π s r x x , \begin{align*}\hfill k\left(\mathbf{x},{\mathbf{x}}^{\prime };R\right)=\frac{{\sigma }_{0}^{2}}{R}\sum\limits _{r=1}^{R}\cos \left(2\pi {\mathbf{s}}_{r}^{\top }\left(\mathbf{x}-{\mathbf{x}}^{\prime }\right)\right),\end{align*} (10)
where R $R$ is the number of spectral points specified by the user, σ 0 2 ${\sigma }_{0}^{2}$ is a positive parameter, and s r ${\mathbf{s}}_{r}$ is a p $p$ -dimensional vector of spectral frequencies; both σ 0 2 ${\sigma }_{0}^{2}$ and s r ${\mathbf{s}}_{r}$ are estimated from the training data. This layer can be viewed as a specific form of random feature expansion (Cutajar et al., 2017; Rahimi & Recht, 2007), which allows the model to learn spectral features in observations. It can be shown that the spectral delta kernel can approximate any stationary kernel by adjusting its hyperparameters, and the accuracy relies on the number of spectral points R $R$ (Lázaro-Gredilla et al., 2010). We choose R = 1000 $R=1000$ in this study.

We use linear mean function for all hidden layers. Duvenaud et al. (2014) shows that DGP with zero mean functions for the inner layers leads to degeneration in representational capacity when the number of layers increases. Using linear mean function avoids this pathology (Salimbeni & Deisenroth, 2017). These mean function parameters are also estimated from the training set.

3.4 Model Evaluation Metrics

The choice of evaluation metrics depends on the objective of the prediction task. In this study, we consider objectives for both continuous and categorical forecasts. We use y $y$ to denote the true value and y ˆ $\widehat{y}$ to denote the predicted value. For continuous forecasts, we calculate the mean absolute error (MAE), which is defined as:
MAE = 1 Q i = 1 Q | y i y ˆ i | , $\text{MAE}=\frac{1}{Q}\sum\limits _{i=1}^{Q}\vert {y}_{i}-{\widehat{y}}_{i}\vert ,$ (11)
where Q $Q$ is the number of data points. For the GeoDGP case, y ˆ i ${\widehat{y}}_{i}$ is taken to be the mean of the Gaussian mixture posterior-predictive distribution.
Categorical forecasts are also widely used in the community (e.g., Pulkkinen et al., 2013), where real-valued predictions and observations are dichotomized into binary outcomes using a pre-specified threshold. A label of 1 is assigned if the value exceeds the threshold, and 0 otherwise. A contingency table in Table 2 can be formed to summarize the number of hits, false alarms, misses, and no events. Based on these quantities, the HSS (Heidke, 1926) is a popular accuracy measure (Al Shidi et al., 2022; Pulkkinen et al., 2013; Welling et al., 2018) defined as:
HSS = 2 ( H N M F ) ( H + M ) ( M + N ) + ( H + F ) ( F + N ) . $\text{HSS}=\frac{2(HN-MF)}{(H+M)(M+N)+(H+F)(F+N)}.$ (12)
Table 2. Contingency Table for Defining the Heidke Skill Score
Observed ( y ) $(y)$
Predicted y ^ $\left(\hat{y}\right)$ 1 (above threshold) 0 (below threshold)
1 (above threshold) H (Hit: true positive) F (False alarm: false positive)
0 (below threshold) M (Miss: false negative) N (No event: true negative)
The HSS has a range of ${-}\infty $ to 1, with 1 being perfect predictions. A positive value of HSS means the model predictions are better than random forecast (random proportional to the historical frequency of occurrence) and negative value means worse than random. The choice of thresholds depends on the specific task. For example, a threshold of 200 nT corresponds to the magnitude of certain geomagnetic activities, such as the auroral electrojet currents (Akasofu et al., 1980; Klumpar, 1979; Waters et al., 2001). In the GEM challenge, thresholds of 0.3, 0.7, 1.1, and 1.5 nT/s were used to evaluate ( d B / d t ) H ${(\mathrm{d}\mathbf{B}/\mathrm{d}t)}_{H}$ (Pulkkinen et al., 2013). Using the empirical power law correlation presented in Tóth et al. (2014):
d B d t H d B H 292 1.14 , ${\left(\frac{\mathrm{d}\mathbf{B}}{\mathrm{d}t}\right)}_{H}\approx {\left(\frac{\mathrm{d}{B}_{H}}{292}\right)}^{1.14},$ (13)
these thresholds approximately correspond to 100, 200, 300, 400 nT for d B H ${B}_{H}$ evaluation, respectively. Here, the unit of ( d B / d t ) H ${(\mathrm{d}\mathbf{B}/\mathrm{d}t)}_{H}$ is nT/s, and the unit of d B H ${B}_{H}$ is nT. We use these thresholds and additionally include the 50 nT used in Al Shidi et al. (2022) to evaluate the HSS. Note that both d B E ${B}_{E}$ and d B N ${B}_{N}$ can change sign and take on positive or negative values during a geomagnetic storm. As a result, HSS is no longer meaningful for them. Instead, we focus on continuous forecasts for these components and use the MAE as the main evaluation metric. Additionally, we calculate the accuracy of sign predictions:
SA = Sign Accuracy = 1 Q i = 1 Q max 0 , sign y ˆ i sign y i , $\text{SA}=\text{Sign}\,\text{Accuracy}=\frac{1}{Q}\sum\limits _{i=1}^{Q}\max \left(0,\text{sign}\left({\widehat{y}}_{i}\right)\,\text{sign}\left({y}_{i}\right)\right),$ (14)
where sign ( y ) = ± 1 $\text{sign}(y)=\pm 1$ is the sign of y $y$ . Perfect sign accuracy (SA) results in SA = 1 $\text{SA}=1$ , while having the wrong sign all the time gives SA = 0 $\text{SA}=0$ .
Lastly, we evaluate the probabilistic aspect of the model. From a Bayesian perspective, the posterior predictive distribution of GeoDGP represents our updated belief about the distribution of observations conditioned on the observed data, obtained by combining prior knowledge with observed data through Bayes's rule. However, this distribution may not perfectly reflect the true underlying data generating process. Therefore, we perform the evaluation from a practical perspective. The central ( 1 α ) 100 % $(1-\alpha )\cdot 100\%$ prediction interval consists of lower and upper endpoints q L ${q}_{L}$ and q U ${q}_{U}$ that correspond to the predictive quantiles at levels α 2 $\frac{\alpha }{2}$ and 1 α 2 $1-\frac{\alpha }{2}$ . The evaluation should reward forecasters for narrow prediction intervals, while imposing a penalty whose size depends on α $\alpha $ when the interval fails to cover the observation. We consider the interval score proposed in Gneiting and Raftery (2007):
S α q L , q U ; y = q U q L + 2 α max q L y , 0 + max y q U , 0 , ${S}_{\alpha }\left({q}_{L},{q}_{U};y\right)=\left({q}_{U}-{q}_{L}\right)+\frac{2}{\alpha }\left(\max \left({q}_{L}-y,0\right)+\max \left(y-{q}_{U},0\right)\right),$ (15)
where y $y$ is the observation (treated as a random variable). Intuitively, it can be interpreted as the sum of the exact interval length and non-negative penalties, whose sizes correspond to the weighted distances between the observation and the nearest interval endpoints. A smaller S α ${S}_{\alpha }$ therefore indicates better forecast. S α ${S}_{\alpha }$ is also a proper scoring rule, which means it is minimized if and only if the quoted endpoints are equal to the true α 2 $\frac{\alpha }{2}$ th and 1 α 2 $\left(1-\frac{\alpha }{2}\right)$ th quantiles of the observation distribution (Gneiting & Raftery, 2007). Given Q $Q$ observation data points, we use the sample averaged interval score to evaluate the prediction interval:
S = 1 Q i = 1 Q S α q L i , q U i ; y i . $S=\frac{1}{Q}\sum\limits _{i=1}^{Q}{S}_{\alpha }\left({{q}_{L}}_{i},{{q}_{U}}_{i};{y}_{i}\right).$ (16)

We also calculate the sample-averaged interval width W = ( 1 / Q ) i = 1 Q q U i q L i $W=(1/Q){\sum }_{i=1}^{Q}\left({{q}_{U}}_{i}-{{q}_{L}}_{i}\right)$ and the empirical coverage rate C $C$ that is the fraction of observations y i ${y}_{i}$ satisfying q L i y i q U i ${{q}_{L}}_{i}\le {y}_{i}\le {{q}_{U}}_{i}$ that has the expected value 1 α $1-\alpha $ .

4 Results

4.1 Model Comparisons

We divide the test set from Section 2.5 into three subsets for more detailed model comparisons.
  1. Test set 1 consists of 22 geomagnetic storms in 2015.

  2. Test set 2 consists of the 2011-08-05 and 2015-03-15 storms.

  3. Test set 3 consists of the 2024-05-10 Gannon extreme storm.

We use test set 1 to statistically evaluate the performance of GeoDGP and compare it against the Geospace model. Test set 2, although smaller, is specifically used for comparing GeoDGP with the DAGGER model (Upendran et al., 2022), as evaluation of DAGGER is only available on these two storms (Upendran et al., 2022). Finally, test set 3 compares GeoDGP and Geospace during an extreme geomagnetic storm, providing insights into their robustness under severe conditions.

The Geospace model consists of three components of the SWMF (Gombosi et al., 2021; Tóth et al., 2005): the Global Magnetosphere domain represented by the Block-Adaptive Tree Solar wind Roe-type Upwind Scheme (BATSRUS) MHD model (Powell et al., 1993; Tóth et al., 2012), the Inner Magnetosphere simulated by the Rice Convection Model (RCM, Toffoletto et al., 2003), and the Ionosphere Electrodynamics solved by the Ridley Ionosphere Model (RIM, Ridley et al., 2004). The Geospace model is driven by the solar wind and IMF observations ballistically propagated from L1 to the upstream boundary at 32 R E ${R}_{E}$ from the center of Earth and the F 10.7 ${F}_{10.7}$ index. The Geospace model generates, among other products, the magnetic perturbation forecast maps T S ${T}_{S}$ time ahead. Its performance during storm time was studied in Pulkkinen et al. (2013); Tóth et al. (2014); Al Shidi et al. (2022).

The DAGGER model, in contrast, is a data-driven model that combines deep learning with a spherical harmonic basis to predict ground magnetic perturbations T S + 30 ${T}_{S}+30$ min ahead. Specifically, it consists of a Gated Recurrent Unit (Cho et al., 2014) module that summarizes solar wind features, an ANN module that maps these features to coefficients of the spherical harmonic basis, and a spherical harmonic constructor. Upendran et al. (2022) provide a detailed description of the DAGGER model.

GeoDGP has two different forecast horizons: one is T S ${T}_{S}$ , which is the same as the Geospace model and thus allows for a fair comparison, while the other is T S + 1 hr ${T}_{S}+1\,\text{hr}$ , for which there are currently no publicly available models forecasting. Here, we additionally introduce model persistence (MP) (referred to as MP hereafter in tables) and observation persistence (OP) (referred to as OP hereafter in tables) as baselines. A T $T$ lead time persistence model assumes y t + T = y t ${y}_{t+T}={y}_{t}$ for all t $t$ . We compare GeoDGP with T S + 1 hr ${T}_{S}+1\,\text{hr}$ lead time to its 1 hr MP, which is obtained by using the GeoDGP predictions for T S ${T}_{S}$ lead time to predict the observations at T S + 1 ${T}_{S}+1$  hr. This comparison assesses whether the model with the larger lead time provides more informative predictions for the future than a simple shift of earlier predictions. Observation persistence, on the other hand, is a common benchmark for time series prediction; however, it is often difficult to beat on short time scales. We note that real-time magnetometer observations are only available at a few locations, while the models provide forecast on the entire surface of Earth. For OP, since the lead time T S ${T}_{S}$ depends on the solar wind speed and typically ranges from 30 min to 1 hr, we use an average of 45 min as an approximation. Subsequently, we compare GeoDGP predictions with the T S ${T}_{S}$ and T S + 1 ${T}_{S}+1$  hr lead times to the OP of 45 min and 1 hr 45 min, respectively. The comparison assesses whether the model provides more informative predictions for the future than the current observation.

Following Al Shidi et al. (2022), the evaluation time window for each storm starts 6 hr prior to the storm onset and spans a duration of 54 hr, provided that data is available. The storm onset times are selected to be the time when the SYM-H index starts to decrease, and these times are listed in Table 3. We exclude certain storm periods when data is unavailable. This includes periods when:
  1. OMNI data (i.e., the model input) is missing and the gap is larger than 15 min, as described in Section 2.3;

  2. Station measurements are unavailable; and

  3. Predictions from the model being compared (i.e., Geospace and/or DAGGER) are unavailable.

Evaluations involving multiple storms are conducted by first concatenating the storm data and then calculating the evaluation metrics. The performance of the models is compared on a regional basis, as the availability of ground magnetometer stations varies over time. We divide magnetometer stations into low- and mid-latitude, high-latitude, and auroral-latitude regions based on their magnetic latitude λ $\lambda $ . A station is assigned to the low- and mid-latitude region if | λ | 50 ° $\vert \lambda \vert \le 50{}^{\circ}$ , and to the high-latitude region otherwise. The auroral-latitude region is defined as 60 ° | λ | 70 ° $60{}^{\circ}\le \vert \lambda \vert \le 70{}^{\circ}$ . For low- and mid-latitude stations, we calculate the HSS of d B H ${B}_{H}$ predictions using thresholds of 50, 100, and 200 nT. For high-latitude stations, we use thresholds of 50, 200, 300, and 400 nT due to higher disturbance levels. Table 4 shows the regional medians of percentiles in magnetometer station observations corresponding to the selected HSS thresholds during storm periods. The calculation includes 22 storms in 2015 (i.e., test set 1). We see that the largest threshold of 400 nT corresponds to at least the 95th percentile of observations, thus evaluating whether the model can predict the “tail” of the distribution during storm periods. In contrast, the smallest threshold of 50 nT evaluates the model performance around the median of the observations in the high-latitude regions, and around the 70th percentile when all latitudes are considered.

Table 3. List of Storm Onset Times (in UTC) Included in the Study
2011-08-05 18:02 2015-02-16 19:24 2015-03-17 04:07 2015-04-09 21:52
2015-04-14 12:55 2015-05-12 18:05 2015-05-18 10:12 2015-06-07 10:30
2015-06-22 05:00 2015-07-04 13:06 2015-07-10 22:21 2015-07-23 01:51
2015-08-15 08:04 2015-08-26 05:45 2015-09-07 13:13 2015-09-08 21:45
2015-09-20 05:46 2015-10-04 00:30 2015-10-07 01:41 2015-11-03 05:31
2015-11-06 18:09 2015-11-30 06:09 2015-12-19 16:13 2024-05-10 18:00
Table 4. Regional Median of Percentiles in Magnetometer Station Observations Corresponding to the Heidke Skill Score Thresholds During the 22 Geomagnetic Storms in 2015
HSS thresholds (nT) 50 100 200 300 400
Percentiles of low and mid latitudes ( | λ | 50 ° ) $(\vert \lambda \vert \le 50{}^{\circ})$ 76th 95th 100th 100th 100th
Percentiles of high latitudes ( 50 ° | λ | 90 ° ) $(50{}^{\circ}\le \vert \lambda \vert \le 90{}^{\circ})$ 49th 68th 91st 97th 98th
Percentiles of auroral latitudes ( 60 ° | λ | 70 ° ) $(60{}^{\circ}\le \vert \lambda \vert \le 70{}^{\circ})$ 47th 65th 82nd 90th 95th
Percentiles of all latitudes ( | λ | 90 ° ) $(\vert \lambda \vert \le 90{}^{\circ})$ 71st 91st 98th 99th 100th

4.2 Test Set 1: Statistical Comparison Between GeoDGP and Geospace Models

Test set 1 includes a total of 22 geomagnetic storms in 2015, which covers a wide range of events. Specifically, the median minimum Dst is 74 ${-}74$  nT, while the lowest minimum Dst is 234 ${-}234$  nT. On this test set, we evaluate and compare our GeoDGP predictions with the Geospace simulations performed by Al Shidi et al. (2022). The total number of stations being studied is 222. The regional categorization assigns 81 stations to the low- and mid-latitude region, 141 stations to the high-latitude region, and 53 stations to the auroral latitude region. We report the evaluation metrics of the two models in Table 5. The metrics of stations within each region are summarized by their median values. The model with the best performance is bolded.

Table 5. Test Set 1: 22 Geomagnetic Storms in 2015
Δ ${\Delta }$ B Region Metric Geospace GeoDGP (NoDst, OP) GeoDGP (MP, OP)
T S ${T}_{S}$ T S ${T}_{S}$ ( T S ${T}_{S}$ , 45 min) T S ${T}_{S}$ + 1 hr ( T S ${T}_{S}$ + 1 hr, 105 min)
d B H ${B}_{H}$ Low and mid latitudes; | λ | 50 ° $\vert \lambda \vert \le 50{}^{\circ}$ ; 81 Stations HSS (50 nT) 0.57 0.74 (0.51, 0.75) 0.71 (0.69, 0.61)
HSS (100 nT) 0.55 0.75 (0.36, 0.75) 0.72 (0.69, 0.65)
HSS (200 nT) 0.00 0.32 (0.00, 0.39) 0.06 (0.14, 0.30)
MAE 15 10 (15, 9) 11 (12, 13)
d B N ${B}_{N}$ MAE 16 11 (-, 9) 12 (13, 14)
SA 0.84 0.90 (-,0.92) 0.90 (0.89, 0.88)
d B E ${B}_{E}$ MAE 11 10 (-, 6) 10 (10, 9)
SA 0.58 0.63 (-, 0.81) 0.63 (0.63, 0.73)
d B H ${B}_{H}$ High latitudes; 50 ° | λ | 90 ° $50{}^{\circ}\le \vert \lambda \vert \le 90{}^{\circ}$ ; 141 Stations HSS (50 nT) 0.44 0.57 (0.53, 0.59) 0.49 (0.52, 0.44)
HSS (200 nT) 0.31 0.56 (0.50, 0.48) 0.49 (0.48, 0.31)
HSS (300 nT) 0.26 0.48 (0.46, 0.39) 0.40 (0.38, 0.22)
HSS (400 nT) 0.20 0.42 (0.38, 0.32) 0.32 (0.30, 0.14)
MAE 74 46 (46, 46) 52 (53, 57)
d B N ${B}_{N}$ MAE 86 53 (-, 52) 55 (59, 68)
SA 0.69 0.78 (-, 0.79) 0.76 (0.76, 0.72)
d B E ${B}_{E}$ MAE 44 35 (-, 35) 37 (38, 45)
SA 0.59 0.70 (-, 0.75) 0.68 (0.68, 0.69)
d B H ${B}_{H}$ Auroral latitudes; 60 ° | λ | 70 ° $60{}^{\circ}\le \vert \lambda \vert \le 70{}^{\circ}$ ; 53 Stations HSS (50 nT) 0.49 0.57 (0.55, 0.61) 0.49 (0.53, 0.45)
HSS (200 nT) 0.42 0.59 (0.56, 0.50) 0.54 (0.53, 0.32)
HSS (300 nT) 0.34 0.55 (0.53, 0.44) 0.50 (0.46, 0.26)
HSS (400 nT) 0.27 0.51 (0.48, 0.38) 0.47 (0.41, 0.20)
MAE 80 68 (70, 69) 75 (79, 93)
d B N ${B}_{N}$ MAE 95 70 (-, 74) 74 (81, 102)
SA 0.69 0.78 (-, 0.79) 0.76 (0.76, 0.71)
d B E ${B}_{E}$ MAE 45 37 (-, 40) 38 (39, 49)
SA 0.56 0.66 (-, 0.72) 0.63 (0.63, 0.65)
d B H ${B}_{H}$ All latitudes; | λ | 90 ° $\vert \lambda \vert \le 90{}^{\circ}$ ; 222 Stations HSS (50 nT) 0.49 0.63 (0.53, 0.64) 0.57 (0.57, 0.49)
HSS (200 nT) 0.24 0.53 (0.43, 0.47) 0.43 (0.42, 0.31)
HSS (300 nT) 0.22 0.44 (0.41, 0.36) 0.34 (0.33, 0.19)
HSS (400 nT) 0.19 0.41 (0.36, 0.32) 0.29 (0.29, 0.14)
MAE 27 23 (26, 18) 26 (26, 24)
d B N ${B}_{N}$ MAE 30 21 (-, 19) 23 (24, 26)
SA 0.71 0.82 (-, 0.83) 0.80 (0.80, 0.78)
d B E ${B}_{E}$ MAE 23 18 (-, 17) 19 (19, 21)
SA 0.58 0.67 (-, 0.77) 0.66 (0.67, 0.70)
  • Note. Median HSS, MAE [nT] and sign accuracy (SA) in various magnetic latitude ( λ ) $(\lambda )$ regions for Geospace, GeoDGP, observation persistence (OP), GeoDGP without Dst (only for predicting d B H ${B}_{H}$ ) (NoDst), and GeoDGP model persistence (MP). The corresponding lead time for each model is shown below the model name. The model with the best performance is bolded.

We begin by discussing the performance of GeoDGP with a lead time of T S ${T}_{S}$ . For the all-latitude global prediction of d B H ${B}_{H}$ , GeoDGP achieves a median HSS of 0.63 with the 50 nT threshold, outperforming the Geospace model's HSS of 0.49. For thresholds of 200 nT and above, GeoDGP achieves medium HSS values that roughly double those from the Geospace model: GeoDGP's 0.53 versus Geospace's 0.24 at 200 nT threshold, GeoDGP's 0.44 versus Geospace's 0.22 at 300 nT threshold, and GeoGDP's 0.41 versus Geospace's 0.19 at 400 nT threshold. Additionally, GeoDGP outperforms OP in HSS with thresholds of 200 nT and above, while achieving comparable HSS with the 50 nT threshold. In contrast, the Geospace model's HSS falls below that of OP. These results illustrate the strong performance of GeoDGP in predicting high-level disturbances, and suggest that the Geospace model has a tendency to underpredict, as also noted in Al Shidi et al. (2022). GeoDGP also has a smaller MAE of 23 nT compared to 27 nT of Geospace, although it is larger than the 18 nT of OP which is expected given the short time scale.

For regional comparisons, GeoDGP consistently outperforms the Geospace model and shows higher HSS and lower MAE across all regions. As the threshold increases from 50 to 200 nT, the HSS for both models decreases more rapidly in the low- and mid-latitude region than in the high-latitude region. This is due to the relative rarity of high-level disturbances in the low- and mid-latitude region, where a threshold of 200 nT corresponds to the upper extreme of the distribution (i.e., the 100th percentile as shown in Table 4), making such events inherently more difficult to predict. Figure 2 illustrates this by showing GeoDGP's HSS at thresholds of 50 and 200 nT for each station. The HSS of high-latitude stations remains high under both thresholds, whereas the HSS of low- and mid-latitude stations exhibits noticeable drops as the threshold increases. This shows that for general cases (non-rare events), GeoDGP does not underpredict high-level disturbances. Figure 3 illustrates how HSS varies with magnetic latitude under four different thresholds. Across all latitudes, GeoDGP consistently achieves higher HSS compared to Geospace. For both models, increasing the threshold leads to narrower station distributions concentrating around higher magnetic latitudes, that is, closer to the auroral zone, as expected. Table 5 shows that the increasing level of disturbances with magnetic latitude also results in a corresponding increase in MAE, and GeoDGP achieves lower MAE values compared to Geospace. Moreover, GeoDGP achieves comparable or higher HSS than OP across all regions. Given that most ground magnetometer station data is not available in real time and that global station coverage is uneven, the stable performance of GeoDGP underscores its potential for real-time forecasting of local ground magnetic field perturbations worldwide.

Details are in the caption following the image

Heidke Skill Score (HSS) of GeoDGP with prediction lead time T S ${T}_{S}$ for individual magnetometer stations. The top and bottom panels show the HSS with thresholds of 50 nT (top) and 200 nT (bottom), respectively. The evaluation included 22 geomagnetic storms in 2015. The black solid curves represent the 50 ° $50{}^{\circ}$ magnetic latitude gridlines that separate the low- and mid-latitude region from the high-latitude region. The shaded areas represent the 60 ° $60{}^{\circ}$ - 70 ° $70{}^{\circ}$ auroral-latitude region.

Details are in the caption following the image

Heidke Skill Score (HSS) of GeoDGP (red) and Geospace (blue) for individual magnetometer stations by magnetic latitude. The evaluation included 22 geomagnetic storms in 2015. Both models have lead time T S ${T}_{S}$ . The four panels plot HSS under thresholds of 50, 200, 300, and 400 nT. The shaded areas represent the auroral-latitude region between 60 ° $60{}^{\circ}$ and 70 ° $70{}^{\circ}$ . The dashed line indicates the 50 ° $50{}^{\circ}$ magnetic latitude, which separates the high-latitude region from the low- and mid-latitude region.

Predicting d B N ${B}_{N}$ and d B E ${B}_{E}$ is more challenging due to the added difficulty in determining the correct sign of these components. The regional comparisons show that both models have higher MAE values in predicting d B N ${B}_{N}$ compared to d B H ${B}_{H}$ , despite the fact that the magnitude of d B N ${B}_{N}$ is always smaller than that of d B H ${B}_{H}$ per Equation 1. Table 5 shows that GeoGDP achieves a SA of 0.82 for d B N ${B}_{N}$ in the global all-latitude case, compared to 0.71 of Geospace. With a higher SA, GeoDGP achieves a d B N ${B}_{N}$ prediction performance closer to that of d B H ${B}_{H}$ , compared to Geospace. Specifically, across all latitude regions, GeoDGP achieves a MAE of 21 nT (compared to 30 nT for Geospace) for predicting d B N ${B}_{N}$ and 23 nT (compared to 27 nT for Geospace) for d B H ${B}_{H}$ . While neither model does better than OP in predicting d B N ${B}_{N}$ , GeoDGP comes much closer. For d B E ${B}_{E}$ , the MAE values of both models are much smaller due to d B E ${B}_{E}$ ’s lower average magnitude, while GeoDGP still outperforms Geospace. However, the SA of d B E ${B}_{E}$ is low for both models, with GeoDGP scoring below 0.7 and Geospace below 0.6, both significantly lower than the 0.77 from OP. Accurately predicting d B E ${B}_{E}$ thus remains a challenge. This is consistent with the general understanding of the configurations of the current systems leading to these ground magnetic perturbations. For example, the classical substorm current wedge configuration (McPherron et al., 1973) predicts dominant northward perturbations at a mid-latitude station on the night side but eastward (westward) perturbations on the west (east) part of the current wedge.

Extending the forecast lead time of GeoDGP from T S ${T}_{S}$ to T S + 1 hr ${T}_{S}+1\,\text{hr}$ results in decreased model performance, as expected. However, as shown in Table 5, GeoDGP with T S + 1 hr ${T}_{S}+1\,\text{hr}$ still outperforms Geospace with T S ${T}_{S}$ across all evaluation metrics. It also outperforms the 45 min + 1 hr $45\,\text{min}+1\,\text{hr}$ OP with significantly higher HSS in predicting d B H ${B}_{H}$ , lower MAE and higher SA in predicting d B N ${B}_{N}$ , and comparable metrics in predicting d B E ${B}_{E}$ . Moreover, GeoDGP with T S + 1 hr ${T}_{S}+1\,\text{hr}$ demonstrates better or similar performance compared to the 1 hr GeoDGP MP. While we explored extending GeoDGP's forecast lead time to T S + 2 hr ${T}_{S}+2\,\text{hr}$ and beyond, the performance declined significantly. Therefore, we conclude that the current GeoGDP configuration provides reliable predictions up to T S + 1 ${T}_{S}+1$  hr into the future, which corresponds to a forecast range of 1.5–2 hr in practice, depending on the solar wind speed.

Additionally, Table 6 shows the evaluation of the probabilistic aspect of GeoDGP using the regional median of coverage rate, average interval width, and average interval score of the prediction intervals. Figure 4 further illustrates how these measures vary with magnetic latitude for T S ${T}_{S}$ time ahead d B H ${B}_{H}$ predictions. Although the empirical coverage of 98% is close to the nominal rate of 95%, we observe an uneven distribution of coverage rate across magnetic latitude, likely due to regional differences in disturbance levels. The coverage rate in the low-and-mid latitude is near 100%, indicating overly wide prediction intervals, whereas in the auroral latitude region, the coverage rate falls well below the nominal rate. As a result, we also find significantly higher interval score in the auroral latitude region compared to rest of the high-latitude region. The interval score differs from the interval width by including a penalty term for missed observation coverage. This indicates that predictions in the auroral-latitude region remain challenging. On the other hand, the presence of spatial heterogeneity in predictive uncertainty is clearly demonstrated by the increasing average interval width from lower to higher magnetic latitudes. Table 6 also shows that the interval scores of all latitudes predictions for T S ${T}_{S}$ ahead are consistently lower than the predictions for T S ${T}_{S}$  + 1 hr ahead as expected.

Table 6. Regional Median of Coverage Rate C $C$ (%), Average Interval Width W $W$ (nT) and Average Interval Score S $S$ (nT) of GeoDGP Prediction Intervals Evaluated for 22 Geomagnetic Storms in 2015
Region Metric d B H ${B}_{H}$ , d B N ${B}_{N}$ , d B E ${B}_{E}$ TS d B H ${B}_{H}$ , d B N ${B}_{N}$ , d B E ${B}_{E}$ TS + 1 hr
Low and mid latitudes | λ | 50 ° $\vert \lambda \vert \le 50{}^{\circ}$ Coverage Rate 99%, 99%, 99% 99%, 99%, 99%
Interval Width 131, 217, 170 142, 237, 174
Interval Score 132, 217, 170 143, 237, 175
High latitudes 50 ° | λ | 90 ° $50{}^{\circ}\le \vert \lambda \vert \le 90{}^{\circ}$ Coverage Rate 91%, 88%, 93% 91%, 90%, 92%
Interval Width 171, 225, 173 188, 246, 176
Interval Score 409, 589, 346 450, 576, 378
Auroral latitudes 60 ° | λ | 70 ° $60{}^{\circ}\le \vert \lambda \vert \le 70{}^{\circ}$ Coverage Rate 81%, 83%, 91% 82%, 84%, 90%
Interval Width 175, 229, 174 192, 248, 177
Interval Score 837, 832, 424 832, 832, 449
All latitudes | λ | 90 ° $\vert \lambda \vert \le 90{}^{\circ}$ ; Coverage Rate 98%, 98%, 98% 98%, 98%, 98%
Interval Width 141, 218, 170 154, 239, 175
Interval Score 185, 269, 206 216, 298, 216
  • Note. The nominal coverage rate is 95%.
Details are in the caption following the image

Sample averaged coverage rate, interval width and interval score of GeoDGP d B H ${B}_{H}$ probabilistic prediction versus magnetic latitude. The prediction has a lead time of T S ${T}_{S}$ , and the evaluation included 22 geomagnetic storms in 2015. Each point represents a station. The average coverage rate is 98% (black horizontal line in Panel 1), while the nominal rate is 95% (red horizontal line in Panel 1). The shaded areas represent the auroral-latitude region between 60 ° $60{}^{\circ}$ and 70 ° $70{}^{\circ}$ . The dashed line indicates the 50 ° $50{}^{\circ}$ magnetic latitude, which separates the high-latitude region from the low- and mid-latitude region.

As a final note in this section, we highlight that the only difference in model input between GeoDGP and Geospace is the inclusion of the Dst index, which characterizes the overall disturbance level of the magnetosphere. To demonstrate the benefit of including the Dst index for predicting d B H ${B}_{H}$ , we train a model using the same architecture of GeoDGP but without the Dst index as an input. The results, shown in Table 5, indicate that the performance of the modified model to be consistently worse than GeoDGP. However, it still outperforms the Geospace model. This finding demonstrates the value of including the Dst index as an input and illustrates the advantage of the data-driven GeoDGP approach over the physics-based first-principles Geospace model. It also aligns with prior studies (e.g., Smith et al., 2020), which demonstrate that including state-characterizing variables (e.g., the Dst index) of magnetospheric systems can improve predictive performance.

4.3 Test Set 2: Comparison Between GeoDGP and DAGGER

Test set 2 consists of two geomagnetic storms: the 2011-08-05 storm with a minimum Dst of 115 ${-}115$  nT and the 2015-03-15 storm with a minimum Dst of 234 ${-}234$  nT. This test set is used to compare GeoDGP with the DAGGER model, which operates with a forecast horizon of T S + 0.5 hr ${T}_{S}+0.5\,\text{hr}$ . We show that, compared to DAGGER, GeoDGP not only extends the forecast horizon by 30 min but also achieves better performance. The evaluation covers a total of 178 stations, all located in the Northern Hemisphere with magnetic latitudes greater than 40 ° ${}^{\circ}$ , since the DAGGER model has only been evaluated on this subset.

Table 7 presents the evaluation metrics for both models. GeoDGP with a T S + 1 hr ${T}_{S}+1\,\text{hr}$ lead time shows a significant advantage over DAGGER in the all-latitude global prediction of d B H ${B}_{H}$ . Specifically, GeoDGP achieves a median HSS of 0.36 with the 50 nT threshold (compared to DAGGER's 0.22), 0.49 with the 200 nT threshold (compared to DAGGER's 0.05), 0.38 with the 300 nT threshold (compared to DAGGER's 0.00) and 0.34 with the 400 nT threshold (compared to DAGGER's 0.00). These results indicate that GeoDGP with T S + 1 hr ${T}_{S}+1\,\text{hr}$ remains effective in predicting high-level disturbance, whereas DAGGER performs no better than a random forecast at high thresholds (i.e., zero or near-zero HSS values). GeoDGP also achieves a lower MAE of 64 and 69 nT (compared to DAGGER's 82 and 76 nT) in predicting d B H ${B}_{H}$ and d B N ${B}_{N}$ , respectively, while achieving a comparable MAE of 45 nT (compared to DAGGER's 44 nT) in predicting d B E ${B}_{E}$ . Similar trends are observed in regional comparisons, where GeoDGP consistently outperforms DAGGER. Comparisons with OP reaffirm the findings from Section 4.2, showing that GeoDGP with T S + 1 hr ${T}_{S}+1\,\text{hr}$ outperforms OP. Additionally, GeoGDP either matches or exceeds the performance to the 1 hr GeoGDP MP.

Table 7. Test Set 2: 2011-08-05 and 2015-03-15 Storms
Δ ${\Delta }$ B Region Metric DAGGER GeoDGP (MP, OP)
T S ${T}_{S}$ + 30 min T S ${T}_{S}$ + 1 hr ( T S ${T}_{S}$ + 1 hr, 105 min)
d B H ${B}_{H}$ Low and mid latitudes; 40 ° λ 50 ° $40{}^{\circ}\le \lambda \le 50{}^{\circ}$ ; 26 Stations HSS (50 nT) 0.15 0.63 (0.62, 0.55)
HSS (100 nT) 0.03 0.73 (0.70, 0.64)
HSS (200 nT) 0.00 0.02 (0.01, 0.34)
MAE 49 16 (16, 22)
d B N ${B}_{N}$ MAE 45 16 (18, 23)
SA 0.83 0.96 (0.95, 0.93)
d B E ${B}_{E}$ MAE 21 20 (18, 19)
SA 0.70 0.74 (0.74, 0.79)
d B H ${B}_{H}$ HSS (50 nT) 0.25 0.31 (0.34, 0.31)
High latitudes; 50 ° λ 90 ° $50{}^{\circ}\le \lambda \le 90{}^{\circ}$ ; 152 Stations HSS (200 nT) 0.07 0.52 (0.43, 0.22)
HSS (300 nT) 0.00 0.39 (0.33, 0.09)
HSS (400 nT) 0.00 0.35 (0.20, 0.02)
MAE 86 72 (83, 95)
d B N ${B}_{N}$ MAE 82 77 (86, 107)
SA 0.75 0.79 (0.78, 0.72)
d B E ${B}_{E}$ MAE 50 50 (52, 63)
SA 0.67 0.68 (0.69, 0.68)
d B H ${B}_{H}$ HSS (50 nT) 0.32 0.31 (0.36, 0.32)
Auroral latitudes; 60 ° λ 70 ° $60{}^{\circ}\le \lambda \le 70{}^{\circ}$ ; 64 Stations HSS (200 nT) 0.22 0.51 (0.43, 0.20)
HSS (300 nT) 0.18 0.48 (0.40, 0.13)
HSS (400 nT) 0.05 0.50 (0.41, 0.06)
MAE 98 88 (96, 118)
d B N ${B}_{N}$ MAE 98 92 (106, 137)
SA 0.75 0.78 (0.76, 0.70)
d B E ${B}_{E}$ MAE 54 55 (58, 70)
SA 0.62 0.66 (0.67, 0.65)
d B H ${B}_{H}$ HSS (50 nT) 0.22 0.36 (0.39, 0.37)
All latitudes; λ 40 ° $\lambda \ge 40{}^{\circ}$ ; 178 Stations HSS (200 nT) 0.05 0.49 (0.40, 0.23)
HSS (300 nT) 0.00 0.38 (0.32, 0.07)
HSS (400 nT) 0.00 0.34 (0.19, 0.02)
MAE 82 64 (72, 87)
d B N ${B}_{N}$ MAE 76 69 (80, 97)
SA 0.76 0.82 (0.81, 0.74)
d B E ${B}_{E}$ MAE 44 45 (47, 53)
SA 0.67 0.68 (0.69, 0.69)
  • Note. Median HSS, MAE [nT] and sign accuracy in various magnetic latitude ( λ ) $(\lambda )$ regions of DAGGER, GeoDGP, GeoDGP model persistence (MP) and observation persistence (OP). The corresponding lead time for each model is shown below the model name. The model with the best performance is bolded. All stations are in the Northern Hemisphere with λ 40 ° $\lambda \ge 40{}^{\circ}$ .

To further evaluate the model's ability to capture temporal behavior of perturbations during a geomagnetic storm, we present station-wise d B H ${B}_{H}$ predictions for the 2011-08-05 storm in Figure 5 using the same six stations analyzed by Pulkkinen et al. (2013) (see Table 8 for station details). Due to the lack of data for station PBQ during this storm, we replace it with station MEA, which has a similar magnetic latitude and is located in a densely populated area. Figure 5 shows that while both GeoDGP and DAGGER capture the major peaks and troughs in the observations, DAGGER has a tendency to underpredict large perturbations while GeoDGP provides predictions of a similar magnitude to the observation even at a larger forecast horizon. Additionally, compared to DAGGER, the predictive mean of GeoDGP exhibits lower-amplitude oscillations superimposed on the prediction trend, with periods of several hours that generally align well with observations. Lastly, GeoDGP provides probabilistic forecast of the perturbation values that reflect their uncertainty, whereas DAGGER does not offer this capability.

Details are in the caption following the image

Test set 2: The 2011-08-05 storm. d B H $\mathrm{d}{B}_{H}$ observations (black) and predictions using GeoDGP (red) and DAGGER (green) at six stations. GeoDGP has a lead time of T S + 1 hr ${T}_{S}+1\,\text{hr}$ and DAGGER has a lead time of T S + 0.5 hr ${T}_{S}+0.5\,\text{hr}$ . The shaded area represents the 95% prediction intervals of GeoDGP.

Table 8. Information of the Six Magnetometer Stations Studied in Figures 5, 7, 8, and 9
Station IAGA code Magnetic latitude Magnetic longitude
Abisko ABK 66.00 114.22
Newport NEW 54.53 54.21 ${-}54.21$
Ottawa OTT 55.08 3.96 ${-}3.96$
Meanook MEA 61.26 52.52 ${-}52.52$
Wingst WNG 53.88 94.94
Yellowknife YKC 68.68 59.01 ${-}59.01$

4.4 Test Set 3: The May 2024 Gannon Extreme Storm

The May 2024 Gannon storm was an extreme G5 geomagnetic storm caused by multiple Earth-directed coronal mass ejections and associated M and X-class solar flares. It was the strongest geomagnetic storm in the last 20 years, with a minimum Dst of 412 ${-}412$  nT. In this section, we showcase the advantages of GeoDGP over a well-tuned Geospace simulation, and demonstrate GeoDGP to produce better or comparable forecast than OP, even under these extreme space weather conditions. Due to several data gaps in the OMNI data set during this storm, we use the ballistically propagated 1-min solar wind measurements from the NASA Advanced Composition Explorer (ACE) (Stone et al., 1998) satellite as input to GeoDGP. The Geospace simulation uses WIND (Acuña et al., 1995) satellite observations, which have similar V x ${V}_{x}$ and B z ${B}_{z}$ measurements as those from ACE but have more accurate plasma density N p $\left({N}_{p}\right)$ measurements, as shown in Figure 6. However, since GeoDGP is trained using OMNI data set that largely consists of ACE data, the model may already possess the ability to adjust to ACE input. The Geospace model uses a reduced inner boundary radius of 1.9 R E ${R}_{E}$ (instead of the default 2.5 R E ${R}_{E}$ ) and the inner boundary density is set to ρ inner = 64 + 0.22 ${\rho }_{\text{inner}}=64+0.22$ CPCP (instead of the default 28 + 0.1 $28+0.1$ CPCP), where the density is measured in units of amu cm 3 ${\text{cm}}^{-3}$ and the cross polar cap potential CPCP in kV. We report evaluation metrics in Table 9 over a total of 206 stations covering both hemispheres of the Earth.

Details are in the caption following the image

Test set 3: 2024-05-10 Gannon extreme storm. Time series of the SMR index, solar wind speed V x $\left({V}_{x}\right)$ , plasma density N p $\left({N}_{p}\right)$ , and B x ${B}_{x}$ , B y ${B}_{y}$ , and B z ${B}_{z}$ . The minimum SMR index value of 426.6 ${-}426.6$  nT was recorded at 2024-05-10 22:36:00 UTC. GeoDGP used Advanced Composition Explorer (red) for model input and Geospace uses WIND (blue) satellite observations as model input.

Table 9. Test Set 3: 2024-05-10 Gannon Extreme Storm
Δ ${\Delta }$ B Region Metric Geospace GeoDGP (OP) GeoDGP (MP, OP)
T S ${T}_{S}$ T S ${T}_{S}$ (45 min) T S ${T}_{S}$ + 1 hr ( T S ${T}_{S}$ + 1 hr, 105 min)
d B H ${B}_{H}$ Low and mid latitudes; | λ | 50 ° $\vert \lambda \vert \le 50{}^{\circ}$ ; 83 Stations HSS (50 nT) 0.36 0.86 (0.80) 0.78 (0.80, 0.76)
HSS (100 nT) 0.62 0.73 (0.76) 0.72 (0.71, 0.70)
HSS (200 nT) 0.60 0.81 (0.77) 0.80 (0.74, 0.69)
MAE 55 35 (34) 36 (44, 47)
d B N ${B}_{N}$ MAE 55 41 (35) 39 (46, 50)
SA 0.93 0.94 (0.96) 0.93 (0.93, 0.93)
d B E ${B}_{E}$ MAE 36 34 (22) 35 (34, 29)
SA 0.53 0.60 (0.86) 0.60 (0.62, 0.79)
d B H ${B}_{H}$ High latitudes; 50 ° | λ | 90 ° $50{}^{\circ}\le \vert \lambda \vert \le 90{}^{\circ}$ ; 123 Stations HSS (50 nT) 0.45 0.33 (0.56) 0.26 (0.24, 0.44)
HSS (200 nT) 0.62 0.71 (0.59) 0.62 (0.62, 0.46)
HSS (300 nT) 0.51 0.62 (0.50) 0.51 (0.49, 0.35)
HSS (400 nT) 0.45 0.55 (0.45) 0.37 (0.39, 0.27)
MAE 149 125 (157) 146 (159, 188)
d B N ${B}_{N}$ MAE 194 159 (183) 173 (191, 236)
SA 0.69 0.77 (0.78) 0.75 (0.76, 0.70)
d B E ${B}_{E}$ MAE 133 99 (111) 108 (106, 132)
SA 0.60 0.72 (0.75) 0.72 (0.73, 0.69)
d B H ${B}_{H}$ Auroral latitudes; 60 ° | λ | 70 ° $60{}^{\circ}\le \vert \lambda \vert \le 70{}^{\circ}$ ; 48 Stations HSS (50 nT) 0.50 0.30 (0.52) 0.20 (0.21, 0.40)
HSS (200 nT) 0.54 0.60 (0.52) 0.57 (0.58, 0.36)
HSS (300 nT) 0.48 0.57 (0.44) 0.45 (0.46, 0.32)
HSS (400 nT) 0.42 0.53 (0.42) 0.38 (0.36, 0.26)
MAE 179 161 (196) 171 (194, 247)
d B N ${B}_{N}$ MAE 206 181 (218) 198 (222, 279)
SA 0.67 0.75 (0.77) 0.72 (0.74, 0.69)
d B E ${B}_{E}$ MAE 139 106 (126) 120 (119, 148)
SA 0.60 0.72 (0.72) 0.73 (0.72, 0.69)
d B H ${B}_{H}$ All latitudes; | λ | 90 ° $\vert \lambda \vert \le 90{}^{\circ}$ ; 206 Stations HSS (50 nT) 0.38 0.53 (0.64) 0.50 (0.48, 0.53)
HSS (200 nT) 0.61 0.74 (0.63) 0.66 (0.66, 0.50)
HSS (300 nT) 0.48 0.61 (0.51) 0.49 (0.47, 0.35)
HSS (400 nT) 0.35 0.48 (0.40) 0.33 (0.34, 0.20)
MAE 108 91 (103) 102 (112, 133)
d B N ${B}_{N}$ MAE 131 116 (116) 126 (139, 149)
SA 0.82 0.87 (0.87) 0.86 (0.86, 0.82)
d B E ${B}_{E}$ MAE 81 67 (74) 74 (75, 87)
SA 0.58 0.70 (0.79) 0.67 (0.70, 0.72)
  • Note. Median heidke skill score, mean absolute error [nT] and sign accuracy over magnetic latitude (λ) regions for geospace, GeoDGP, observation persistence and GeoDGP model persistence. The corresponding lead time for each model is shown below the model name. The model with the best performance is bolded.

With a forecast horizon of T S ${T}_{S}$ , GeoDGP outperforms Geospace in the all-latitude global prediction, achieving lower median MAE values (91, 116, and 67 nT) for the three perturbation components (H, N, and E) along with higher median HSS values (0.53, 0.74, 0.61, and 0.48 for the 50, 200, 300, and 400 nT thresholds, respectively) when predicting d B H ${B}_{H}$ . Interestingly, the HSS values for both models at thresholds above 200 nT are higher than their counterparts in Table 5. This is likely due to the extreme intensity of the storm, during which high-level disturbances were frequently recorded even at low- and mid-latitude stations, which is uncommon for moderate storms. This intensity is also evident from the rising MAE values, approximately four times larger than those in Table 5 for all three perturbation components. Despite this increase in disturbance intensity, GeoDGP consistently demonstrates its strong predictive power in forecasting high-level disturbances during extreme geomagnetic storms. This conclusion is further supported by the comparisons with OP, where GeoDGP achieves higher HSS values for all thresholds above 200 nT and lower MAE values in predicting d B H ${B}_{H}$ , while having comparable performance in predicting d B N ${B}_{N}$ and d B E ${B}_{E}$ . Regional comparisons also confirm the superiority of GeoDGP over Geospace and OP. We note that, although OP generally perform well at short lead times during minor storms or non-storm periods, it struggles during moderate to extreme geomagnetic storms, such as the Gannon storm. In such cases, it often fails to account for abrupt changes driven by the rapidly evolving magnetic field and solar wind conditions, leading to underestimation or delayed responses. Additionally, GeoDGP predictions with T S + 1 hr ${T}_{S}+1\,\text{hr}$ lead time generally achieve higher HSS values in predicting d B H ${B}_{H}$ and lower MAE across all perturbation components compared to Geospace and OP. Although GeoDGP's d B H ${B}_{H}$ HSS is close to that of GeoDGP 1-hr MP, it achieves lower MAE for all components.

We present station-wise predictions of the three perturbation components (H, N and E) with a T S ${T}_{S}$ lead time for the six stations listed in Table 8. Figure 7 first illustrates the predictions of d B H ${B}_{H}$ . All six stations recorded d B H ${B}_{H}$ values exceeding 1,000 nT due to the extreme intensity of the storm. While both the GeoDGP and Geospace models capture the overall trends in the time series, GeoDGP provides predictions closer to observations while Geospace occasionally underpredicts the peaks. However, during periods where the measurements dropped to low values, GeoDGP occasionally misses the dips and performs worse than Geospace in such instances. These results align with the HSS values reported in Table 9. Figures 8 and 9 show the predictions for d B N ${B}_{N}$ and d B E ${B}_{E}$ , respectively. These components are more challenging to predict due to the difficulty in correctly determining their signs. Table 9 shows that both models tend to miss the sign for d B E ${B}_{E}$ more frequently than for d B N ${B}_{N}$ . Both models also tend to underpredict the peaks of d B E ${B}_{E}$ than those of d B N ${B}_{N}$ . These findings indicate the risk of underpredicting d B H ${B}_{H}$ if predictions are derived indirectly via Equation 1 rather than training a model directly on d B H ${B}_{H}$ .

Details are in the caption following the image

Test set 3: 2024-05-10 Gannon extreme storm, d B H $\mathrm{d}{B}_{H}$ observations (black) and predictions using GeoDGP (red) and Geospace (blue) at six stations. Both models have a lead time of T S ${T}_{S}$ . The shaded area represents the 95% prediction intervals of GeoDGP.

Details are in the caption following the image

Test set 3: 2024-05-10 Gannon extreme storm, d B N $\mathrm{d}{B}_{N}$ observations (black) and predictions using GeoDGP (red) and Geospace (blue) at six stations. Both models have a lead time of T S ${T}_{S}$ . The shaded area represents the 95% prediction intervals of GeoDGP.

Details are in the caption following the image

Test set 3: the 2024-05-10 Gannon extreme storm, d B E $\mathrm{d}{B}_{E}$ observations (black) and predictions using GeoDGP (red) and Geospace (blue) at six stations. Both models have a lead time of T S ${T}_{S}$ . The shaded area represents the 95% prediction intervals of GeoDGP.

Next, we compare GeoDGP and Geospace based on the global perturbation maps of d B H ${B}_{H}$ during the peak time of the Gannon extreme storm, as indicated by when the SMR index reaches its minimum value of 426.6 ${-}426.6$  nT at 2024-05-10 22:36:00 UTC. The SMR index, shown in Figure 6, is a composite ring current index that can be viewed as a high spatiotemporal resolution counterpart to the Dst index (Newell & Gjerloev, 2012). It is produced at 1-min cadence instead of Dst's hourly cadence, and uses 98 low- and mid-latitude magnetometers instead of Dst's 4. The global maps at the peak time are shown in Figure 10. Magnetometer stations are denoted by circles, with station observations indicated by the circle fill color. The background contour represents the global model predictions, and the shaded areas represent the night-side of Earth.

Details are in the caption following the image

Test set 3: 2024-05-10 Gannon extreme storm. Model predictions of d B H $\mathrm{d}{B}_{H}$ (background contour) from GeoDGP (top) and Geospace (bottom) with lead time T S ${T}_{S}$ . The maps correspond to the time of minimum SMR at 2024-05-10 22:36:00 UT. Circles represent magnetometer stations, with fill color indicating their observations. The shaded areas represent the night-side of Earth.

Qualitatively, the predictions from both GeoGDP and Geospace correlate well with station observations worldwide. However, the observations show some localized disturbances that disagree with both model's predictions, such as the small perturbations measured at geographic latitude of approximately 80 ° ${}^{\circ}$ N near northern Norway. We also observe that the night-side shows stronger observed perturbations compared to the day-side, which is fully expected due to the auroral electrojets being stronger toward the night-side than on the day-side. Also, there is a clear dawn-dusk asymmetry with stronger ground magnetic perturbations seen on the dawn-side, which is likely due to extremely large negative IMF B y ${B}_{\mathrm{y}}$ during this time as shown in Figure 6. Since only a few observations are available in the high-latitude region between 45 ° ${}^{\circ}$ E and 180 ° ${}^{\circ}$ E longitude in the Northern Hemisphere, we include FAC data from the Active Magnetosphere and Planetary Electrodynamics Response Experiment (AMPERE) in Figure 11 for reference, alongside predictions from both models in polar projection. The AMPERE FAC shows a clear dawn-dusk asymmetry with stronger FACs on the dawn-side at this time, and both model predictions capture this asymmetry. Meanwhile, the Geospace model shows roughly interhemispheric symmetry in the magnitude of predictions, while GeoDGP displays an asymmetric pattern. Although station observations in the Southern Hemisphere show smaller perturbations than those in the Northern Hemisphere and align more closely with GeoDGP predictions, we note that the Southern Hemisphere has much sparser station coverage compared to the Northern Hemisphere. In particular, there are almost no stations in the southern auroral zone during the peak of this storm. To quantitatively study and compare the performance of the two models in each hemisphere, we reevaluated the model predictions on test set 1 separately for both hemispheres. The results are shown in Table 10. The comparisons show that GeoDGP achieves higher HSS and lower MAE values in both hemispheres than Geospace. However, comparisons in the Southern Hemisphere may not be comprehensive due to the lack of measurements. Furthermore, the Geospace prediction appears less smooth and has more localized disturbances near both pole regions. Interestingly, some of features in the Northern Hemisphere agree well with the station observations. For example, the dip in the perturbations observed at stations in North America, distributed around geographic latitudes of approximately 50 ° ${}^{\circ}$ N, is well captured by Geospace, whereas the smoother solution from GeoDGP misses.

Details are in the caption following the image

Test set 3: 2024-05-10 Gannon extreme storm. AMPERE radial current density observations (top) and model predictions of d B H $\mathrm{d}{B}_{H}$ (background contour) from GeoDGP (middle) and Geospace (bottom) with lead time T S ${T}_{S}$ . The maps correspond to the time of minimum SMR at 2024-05-10 22:36:00 UT. Circles represent magnetometer stations, with fill color indicating their observations. The shaded areas represent the night-side of Earth.

Table 10. Test Set 1: 22 Geomagnetic Storms in 2015
Region Metric Geospace GeoDGP
North South North South
Low and mid latitudes; | λ | 50 ° $\vert \lambda \vert \le 50{}^{\circ}$ ; (52 N; 29 S) HSS (50 nT) 0.57 0.59 0.72 0.74
HSS (100 nT) 0.54 0.58 0.75 0.74
HSS (200 nT) 0.00 0.02 0.30 0.54
MAE 15 15 11 10
High latitudes; 50 ° | λ | 90 ° $50{}^{\circ}\le \vert \lambda \vert \le 90{}^{\circ}$ ; (126 N; 15 S) HSS (50 nT) 0.44 0.31 0.57 0.55
HSS (200 nT) 0.31 0.25 0.56 0.52
HSS (300 nT) 0.26 0.28 0.49 0.38
HSS (400 nT) 0.19 0.28 0.44 0.29
MAE 76 73 48 39
Auroral latitudes; 60 ° | λ | 70 ° $60{}^{\circ}\le \vert \lambda \vert \le 70{}^{\circ}$ ; (51 N; 2 S); HSS (50 nT) 0.49 0.44 0.57 0.56
HSS (200 nT) 0.42 0.30 0.59 0.48
HSS (300 nT) 0.34 0.27 0.55 0.35
HSS (400 nT) 0.27 0.22 0.53 0.28
MAE 80 80 68 62
All latitudes; | λ | 90 ° $\vert \lambda \vert \le 90{}^{\circ}$ ; (178 N; 44 S); HSS (50 nT) 0.48 0.53 0.62 0.72
HSS (200 nT) 0.24 0.21 0.53 0.53
HSS (300 nT) 0.23 0.15 0.46 0.18
HSS (400 nT) 0.19 0.24 0.44 0.29
MAE 36 16 31 11
  • Note. Interhemispheric comparisons of model performance for d B H ${B}_{H}$ prediction based on median HSS and MAE [nT]. Both model feature a forecast horizon of T S ${T}_{S}$ .

4.5 Evaluation of Model Performance Based on Transformer Heating Proxy

Lastly, we use the Gannon extreme storm as an example and discuss the practical use of the GeoDGP model in mitigating the risk of GICs. While GeoDGP achieves better evaluation metrics, the model prediction may not be used directly by ground infrastructure operators for risk management. Domain- and task-specific quantities of interest, such as the impact on transformer heating, must be derived from ground magnetic field perturbations to support decision-making and help prevent another Hydro-Quebec blackout event. ( d B / d t ) H ${(\mathrm{d}\mathbf{B}/\mathrm{d}t)}_{H}$ is a widely used and extensively tested proxy quantity in the community (Mac Manus et al., 2017; Pulkkinen et al., 2013; Viljanen et al., 2001). However, it remains challenging to predict due to its highly variable and often chaotic nature (Kellinsalmi et al., 2022; Tóth et al., 2014). Our current model does not reliably predict ( d B / d t ) H ${(\mathrm{d}\mathbf{B}/\mathrm{d}t)}_{H}$ at 1-min temporal resolution. Nevertheless, as we demonstrate in this section, the model prediction remains effective for estimating GIC-related risks. We use the transformer heating effect as an example, and show that it may not be primarily driven by the highest-frequency temporal variation in d B $\mathbf{B}$ .

As an initial attempt to provide a more practical performance measure than ( d B / d t ) H ${(\mathrm{d}\mathbf{B}/\mathrm{d}t)}_{H}$ , we propose to use the bi-hourly integral of the geoelectric field E $\mathbf{E}$ as an estimate of the potential transformer heating effect. This measure assumes a simple power law frequency dependence of the ground conductance and ignores the orientation and other properties of the affected power lines. The methodology is similar to that of Hu et al. (2024). Below, We derive the measure by step.

The heating of the power grid is generated by the GICs caused by the geoelectric field E $\mathbf{E}$ . The geoelectric field is driven by the d B $\mathbf{B}$ variations and can be calculated in Fourier space. We ignore the effect from the downward component of d B $\mathbf{B}$ . We start by calculating the Fast Fourier Transform of d B $\mathbf{B}$ at magnetometer stations. Figure 12 shows the power spectrum of d B $\mathbf{B}$ based on a 2-hr time interval at station ABK during the Gannon storm. The power spectrum of the model prediction and observations are very similar.

Details are in the caption following the image

Left: The power spectrum of d B $\mathbf{B}$ in a 2-hr time interval starting from the 24th hr after 2024-05-10 at station ABK. Right: The power spectrum of the geoelectric field proxy E ${E}^{\prime }$ in the same time window.

The power spectrum of the geoelectric field E $\mathbf{E}$ is obtained by multiplying the power spectrum of the d B $\mathbf{B}$ variations convoluted with the impedance tensor Z $\mathbf{Z}$ :
E 2 ( ω ) = | Z ( ω ) B ( ω ) | 2 . ${E}^{2}(\omega )=\vert \mathbf{Z}(\omega )\cdot \mathbf{B}(\omega ){\vert }^{2}.$ (17)
For simplicity, we assume that each entry of the impedance matrix Z $Z$ can be approximated as a real scalar with a power-law frequency dependence Z ( ω ) ω α $Z(\omega )\propto {\omega }^{\alpha }$ . Figure 13 illustrates this approximation by showing measurements from two magnetotelluric (MT) sites (NYE56, NYE57) near magnetometer station OTT. The MT data is publicly available at https://ds.iris.edu/spud/emtf (Kelbert, 2020; Kelbert et al., 2019). Ignoring the scalar coefficient, which is independent of whether one uses the observed or predicted B ( ω ) $\mathbf{B}(\omega )$ , we define the geoelectric field proxy as
E 2 ( ω ) = ω 2 α B 2 ( ω ) , ${E}^{\prime 2}(\omega )={\omega }^{2\alpha }{B}^{2}(\omega ),$ (18)
although the proportionality factor connecting it to the actual electric field varies greatly with location. However, using the proper frequency weighting allows a proper evaluation of the magnetic field variation on producing geoelectric field.
Details are in the caption following the image

Approximating impedance matrix Z $Z$ as a scaler with a power-law frequency dependence Z ( ω ) ω 0.5 $Z(\omega )\propto {\omega }^{0.5}$ . Two magnetotelluric sites data near magnetometer station OTT are shown.

The instantaneous heating power of a conductor of length L with resistivity R is approximately
P ( t ) = U ( t ) I ( t ) = U 2 ( t ) R = L 2 R E 2 ( t ) . $P(t)=U(t)I(t)=\frac{{U}^{2}(t)}{R}=\frac{{L}^{2}}{R}{E}^{2}(t).$ (19)
The time integral of P ( t ) $P(t)$ over a time period T $T$ is the energy heating the power grid. Again, we drop the constants L $L$ and R $R$ (and also ignore the direction of the conductors or the distribution of resistivity) to get a simple expression for the total heating:
H ( t ) = t t + T d t E 2 ( t ) d ω E 2 ( ω ) d ω ω 2 α B 2 ( ω ) . $H(t)=\int \nolimits_{t}^{t+T}\mathrm{d}t\,{E}^{2}(t)\propto \int \mathrm{d}\omega \,{E}^{2}(\omega )\approx \int \mathrm{d}\omega \,{\omega }^{2\alpha }{B}^{2}(\omega ).$ (20)

Since the characteristic time for transformer heating is about 2 hr, we set T $T$  = 2 hr. We propose to use these bi-hourly integrals of the weighted power spectrum of d B $\mathbf{B}$ as a proxy for transformer heating effects, based on which we evaluate the accuracy of the model against observations. For the value of α $\alpha $ we use 0.5 based on MT sites measurements as shown in Figure 13. Figure 12 shows the power spectrum of the geoelectric field proxy E ${E}^{\prime }$ at station ABK. The spectral peak of the observation is located around ω = 0.002 $\omega =0.002$ Hz, which corresponds to about 8 min time periods. This suggests that the primary contribution to the transformer heating proxy comes from lower-frequency components of d B $\mathbf{B}$ variations, which are generally more predictable, and that it is not dominated by those of the highest frequency. Further investigations of stations Ottawa (OTT) and Fredericksburg (FRD) support this finding, with the time periods corresponding to the spectral peak typically ranging from 5 to 10 min.

Comparing this quantity between model predictions and observation provides a reasonable estimate of the relative errors in the resulting transformer heating. Figure 14 shows the integral for three stations, ABK at 66 ° $66{}^{\circ}$ magnetic latitude, OTT at 55 ° $55{}^{\circ}$ , and FRD at 48 ° $48{}^{\circ}$ . An advantage of GeoDGP's probabilistic predictions is that the uncertainty, manifested through multiple realizations of the time series, can be propagated to downstream tasks to support decision making. We carry out the uncertainty propagation using Monte Carlo sampling and construct the prediction intervals for the transformer heating proxy. Specifically, for each station, we draw 200 samples from the posterior predictive distribution, use the sample mean to obtain a point estimate, and use the 2.5th and 97.5th sample quantiles as the endpoints to construct the prediction intervals. One artifact we observed in individual realizations, but not in the mean prediction of GeoDGP, is the artificial noise at the 1-min scale, caused by the use of the Matérn-1/2 kernel that places a non-differentiable prior on function values. We apply a boxcar filter of size 3 to the 1-min cadence time series to smooth realizations such that the resulting sample integral proxy is not dominated by high-frequency artifacts. From Figure 14, we find both models to have significant errors in the heating amplitudes compared to the values derived from observations, and similarly the prediction intervals of GeoDGP failed to achieve the nominal coverage rate. On the other hand, both models capture the timing of heating incidents quite well, suggesting their potential practical use for power grid operators. It is very reassuring to see that the large regional differences both in magnitude and timing are well captured by both models.

Details are in the caption following the image

Using fast Fourier transform to evaluate the potential impact of d B $\mathbf{B}$ variations on heating transformers during the Gannon extreme storm. A comparison of the proxy quantity derived from the GeoDGP prediction (red), the Geospace prediction (blue), and observations from three magnetometer stations (black) is shown. The shaded area represents the 95% prediction interval of GeoDGP. Both GeoDGP and Geospace have a prediction lead time of T S ${T}_{S}$ .

5 Conclusions

In this study, we developed GeoDGP, a global, grid-free probabilistic forecasting model for predicting the north (d B N ${B}_{N}$ ), east (d B E ${B}_{E}$ ), and horizontal (d B H ${B}_{H}$ ) components of ground magnetic field perturbations at 1-min time cadence. The model provides predictions with lead times of T S ${T}_{S}$ and T S + 1 hr ${T}_{S}+1\,\text{hr}$ . To evaluate its performance, we conducted statistical analysis using 24 geomagnetic storms that span a wide range of space weather conditions, from moderate to extreme. The evaluation also included observations from over 200 magnetometer stations across the globe.

Evaluation on the 22 geomagnetic storms in 2015 showed GeoDGP to consistently outperform the first-principles Geospace model, under both lead times. Notably, GeoDGP with a T S ${T}_{S}$ lead time, which is the same as Geospace, achieved median HSS values of 0.63, 0.53, 0.44, 0.41 for the 50, 200, 300, 400 nT thresholds in predicting d B H ${B}_{H}$ , a median SA of 0.82 in predicting d B N ${B}_{N}$ and 0.67 in predicting d B E ${B}_{E}$ , and median MAE values of 23, 21, and 18 nT in predicting d B H ${B}_{H}$ , d B N ${B}_{N}$ and d B E ${B}_{E}$ , respectively. The model also captures spatial heterogeneity in predictive uncertainty, accounting for regional differences in disturbance levels. A separate evaluation on two selected storms, aimed at comparing with the DAGGER model, showed GeoDGP to not only extend the forecast horizon from DAGGER's T S + 0.5 hr ${T}_{S}+0.5\,\text{hr}$ to T S + 1 hr ${T}_{S}+1\,\text{hr}$ but still maintained significantly better storm period HSS and MAE. The detailed evaluation under the 2024-05-10 Gannon extreme storm demonstrated the consistent performance of GeoDGP under extreme space weather conditions. In all evaluation cases, GeoDGP with both lead times achieved better or similar performance as OP in predicting d B H $\mathrm{d}{B}_{H}$ , highlighted by high HSS values in predicting high-level disturbances.

This paper introduced several novel elements and key contributions. (a) We encoded the location of magnetometer stations in the SM coordinate system, which naturally incorporated diurnal variations. This allowed us to jointly model the data observed from different stations at different times and improved the observational coverage of the surface of Earth. (b) We modeled the non-stationarity of storm period data using a combination of regular and spectral kernels in the hidden layers of the DGP model, achieving significant improvement in predicting high-level disturbances compared to existing methods. (c) We provided predictive uncertainty along with forecasts, leveraging the probabilistic capabilities of the DGP model. (d) We extended the GeoDGP's forecast horizon to 1 hr plus the L1 propagation time, and still achieved better performance than existing models. (e) GeoDGP is computationally efficient to evaluate. We note that for the full grid with 180 (latitude)  × ${\times} $  360 (longitude) resolution, the model takes an average evaluation time of 4 s on an NVIDIA RTX 4090 GPU (24 GB VRAM) in a system with an Intel Core i9-13900K CPU and 64 GB RAM. This suggests that GeoDGP is suitable for operational deployment and real-time space weather applications. A website (https://csem.engin.umich.edu/GeoDGP) has been developed to provide real-time predictions of ground magnetic field perturbations.

Despite its strong performance, there remains limitations to the GeoDGP model that warrant future work. First, predictions in the auroral zone remains challenging. The model tends to underpredict peaks in d B E ${B}_{E}$ and overpredict troughs in d B H ${B}_{H}$ , while the empirical coverage of three components is significantly lower than the nominal rate of 95%. Second, since we only encode location in SM coordinate system, the model does not account for localized effect of ground conductivity. This exclusion overlooks internal sources of magnetic perturbations arising from currents induced in the solid Earth (Tanskanen et al., 2001). As a result, the model may fail to capture localized enhancements that are operationally significant and have been emphasized in the TPL-007 standard via the supplemental waveform and in reports by the Electric Power Research Institute. While the model supports flexible spatial querying and implies high nominal spatial resolution, its effective resolution is limited by the smoothness of the predictions, especially in comparison to the physics-based Geospace model that resolves finer mesoscale structures. Nonetheless, the smoothness of the GeoDGP predictions serves as a strength in capturing broad spatial trends, making the model well-suited for forecasting regional-scale ground magnetic field perturbations while avoiding overfitting to localized variability. One possible mitigation for the model's resolution limitations is to indirectly encode ground conductivity by including the geographic coordinates of observation sites as additional input. However, extrapolation remains challenging due to the current sparse and uneven observational coverage. A more direct alternative would involve coupling the model with a global ground conductivity model by incorporating features derived from it. Therefore, a thorough investigation into the interplay between model hyperparameters, local geological encoding, and the resulting effective spatial resolution would be a valuable direction for future work. Third, the model performance in the Southern Hemisphere has not been fully evaluated due to the sparsity of measurements. Future efforts aimed at enhancing model accuracy and expanding data sets that target these difficult regions would be highly beneficial. Additionally, tailoring the model to better align with operational practice, such as coupling with models of subsurface conductivity and power network (e.g., Mac Manus et al., 2022), refining the estimation of downstream quantities of interest, and propagating uncertainty in model predictions to support risk-based decision making, would be another valuable direction of future work.

Acknowledgments

This work is supported by the National Science Foundation under Grant PHY-2027555: “SWQU: NextGen SWMF Using Data, Physics and Uncertainty Quantification.” We acknowledge use of NASA/GSFC's Space Physics Data Facility (SPDF)'s OMNIWeb and CDAWeb service (NASA Space Physics Data Facility (SPDF), 2024), and OMNI data (King & Papitashvili, 2020; Papitashvili & King, 2020). We gratefully acknowledge the SuperMAG collaborators (Gjerloev, 2012; Newell & Gjerloev, 2012). The Dst data are provided by the WDC for Geomagnetism, Kyoto and are available through Nose et al. (2015). The magnetotelluric (MT) data (Kelbert, 2020; Kelbert et al., 2019) are publicly available through Kelbert et al. (2011). The model is implemented using Python package GPytorch (Gardner et al., 2018). The SWMF source code is publicly available through Gombosi et al. (2021).

    Data Availability Statement

    The scripts and routines used to produce the results in this manuscript are available in the University of Michigan Library Deep Blue Data Repository at Chen et al. (2024).