A Deep Learning‐Based Methodology for Precipitation Nowcasting With Radar

Nowcasting and early warning of severe convective weather play crucial roles in heavy rainfall warning, flood mitigation, and water resource management. However, achieving effective temporal‐spatial resolution nowcasting is a very challenging task owing to the complex dynamics and chaos. Recently, an increasing amount of research has focused on utilizing deep learning approaches for this task because of their powerful abilities in learning spatiotemporal feature representation in an end‐to‐end manner. In this paper, we present convolutional long short‐term memory with a layer called star‐shape bridge to transfer features across time steps. We build an end‐to‐end trainable model for the nowcasting problem using the radar echo data set. Furthermore, we propose a raining‐oriented loss function inspired by the critical success index and utilize the group normalization technique to refine the convergence performance in optimizing our deep network. Experiments indicate that our model outperforms convolutional long short‐term memory with the cross entropy loss function and the conventional extrapolation method.


Introduction
Precipitation nowcasting refers to providing precise and timely prediction of rainfall intensity over a local region in the near future. It is important in daily life because it can generate society-level emergency rainfall alerts and produce weather guidance for airports. Because of the high accuracy and spatiotemporal resolution requirements, precipitation nowcasting is quite challenging and has emerged as a hot research topic in the meteorological community (Sun et al., 2014).
Existing methods for precipitation nowcasting can roughly be categorized into two classes: numerical weather prediction (NWP)-based methods (Sun et al., 2014;Weisman et al., 2008) and radar echo extrapolation-based methods (Han et al., 2009;Li et al., 1995;Rinehart & Gravey, 1978). The NWP methods yield predictions by numerically solving partial differential equations subject to dynamic and thermodynamic laws. However, the inner mechanisms that underpin atmospheric behavior, which is regarded as a complex spatiotemporal dynamical system, are not fully understood. Hence, NWP-based methods for precipitation nowcasting have essential limitations. Meanwhile, some conventional extrapolation methods, such as COTREC (Mecklenburg et al., 2000;Rinehart & Gravey, 1978), adopt computer vision techniques for precipitation nowcasting. Although the extrapolation methods are effective for reliably predicting radar echo, they lack a strong nonlinear mapping ability to extract inherent features from large amounts of data.
To address these issues, deep learning-based methods utilize large-scale historical data to learn a collection of feature representations for precise prediction (Ranzato et al., 2014;Shi et al., 2015;Zheng et al., 2015). By formulating precipitation nowcasting as a spatiotemporal sequence forecasting problem, that is, predicting the next consecutive frame sequence subsequent to the previous observation sequence (Shi et al., 2015), the pioneering long short-term memory (LSTM) framework proposed in (Sutskever et al., 2014) provides a general deep learning framework for sequence-to-sequence predictive problems by concatenating two LSTMs: one for the input sequence and another for the output sequence. However, the LSTM predictive learning methods focus more on modeling temporal variations, such as object motion. To effectively model spatiotemporal information, Shi et al. (2015) extended the LSTM structure to the convolutional long short-term memory (ConvLSTM) model by exploiting convolutions in both input-to-state and state-to-output transitions. Despite their seminal effort, there exist some deficiencies. Although better than the conventional extrapolation methods used in capturing spatiotemporal correlations, first, the ConvLSTM can be further optimized. Second, their model is optimized by assuming cross entropy loss as a reasonable proxy index of critical success index (CSI), which is widely used in precipitation nowcasting. Finally, the radar echo images used in their work have 100 × 100 pixels, which may not be representative of the dimensionalities in operational forecast systems. When working with high resolution radar echo data sets of 500 × 500 pixels in our study, inevitable small batches and limited GPU memory will degrade the convergence performance of the model.
In this paper, we aim to address these three problems. We introduce a new ConvLSTM with star-shaped bridge architecture to transfer information across time steps (Cao et al., 2019). We also discuss group normalization (GN), which improves the performance of the optimization process. Inspired by the CSI, we employ a special multisigmoid loss that is regarded as the differentiable CSI. Our loss function is suitable for optimization based on stochastic gradient descent.
We compare the results to other baselines, including COTREC and the ConvLSTM with cross entropy as the loss function. It is shown that we achieve considerable performance improvement in comparison with the other models.

Data
The radar echo data set employed in this paper includes composite reflectivity data collected from the dual polarization Weather Surveillance Radar-1988 Doppler Radar (WSR-88D) located in Pudong, Shanghai, as indicated in Figure 1. The data set contains 170,000 echo images generated by volume scans in intervals 10.1029/2019EA000812 of approximately 6 min from October 2015 to July 2018. The echo data are interpolated to 0.01 • × 0.01 • longitude-latitude grids from the radar polar coordinates. To reduce the noise, we further remove the pixels with correlation coefficients of less than 0.85 (Kumjian et al., 2010). We also filter out some nonconsecutive frames due to radar faults and maintenance and apply a stride-1 sliding window to generate 76,779 length-20 sequence samples.
The training set and the testing set contain 44,099 and 32,700 sequence samples, respectively. We rescale the logarithmic reflectivity R to the interval [0, 1] by settingR = R 70 .

COTREC
COTREC, the continuous TREC (tracking radar echo by correlation) method, has been well described in several papers (Li et al., 1995;Mecklenburg et al., 2000;NOVAK, 2007). Due to space considerations, here, we only provide a brief review of the methodology. The TREC method is applied to two successive radar echo images. The first and second images (at times T 1 and T 2 = T 1 +Δt) are divided into a number of equally sized boxes as two-dimensional pixel arrays. The arrays of reflectivity in the first image will then be cross correlated with the arrays in the second image. The correlation coefficient R is used as an objective criterion for the agreement between the patterns of investigated arrays and is calculated using the following formula: where Z 1 and Z 2 are the pixel arrays of reflectivity at T 1 and T 2 , respectively. The summation is carried out for all pixels in the array, and N is the total number of pixels in the corresponding array. The computation can be repeated for all possible arrays of the second image within a search area centered on the arrays of the first image. The center of the array of the second image with the highest correlation will represent the endpoints of motion vectors (i.e., TREC vectors).
Before the whole vector field is used, each motion vector is corrected for consistency. A three-step procedure is applied: 1. Vectors with zero velocity are replaced by average vectors of the neighboring vectors. If a vector deviates by more than 25 • from the local mean of its neighborhood, then it is also replaced by the mean vector. 2. To produce the gridded vector field, a two-pass objective analysis is conducted with the Cressman weighting function. 3. The third step uses a variational technique (COTREC) (Bertsekas, 1982), forcing the vector field to zero divergence when close to the original vector field. The variational analysis can be expressed as follows: where u o and v o are the x and components of the original vectors, and u and v are their corresponding solutions.
Temporal extrapolation makes use of the semi-Lagrangian scheme (Germann & Zawadzki, 2001). Obtaining the patterns of echo growth or decay is straightforward: The difference in the echo intensity along the flow between two successive radar images is deduced by linear extrapolation. Additionally, to prevent unrealistic growth, growth will be terminated after three prediction steps or when the tracked patch is stronger than 50 dBZ. In the case of decay, the extrapolation will also be stopped after three prediction steps or when the patch is weaker than 20 dBZ. Although the process of intensity extrapolation is empirical, we realize some improvement relative to COTREC by focusing only on movement.

Group Normalization
Batch normalization (BN) (Ioffe & Szegedy, 2015) plays an important role in deep learning. BN computes the mean and variance for the samples of one batch to normalize feature maps. Despite its great success, BN exhibits drawbacks that are also caused by its distinct behavior of normalizing along the batch dimension. A small batch size leads to inaccurate estimation of batch statistics. This limits BN's usage for dealing with high resolution images (e.g., radar echo images), which requires small batches constrained by GPU memory consumption.
Recently, Wu and He (2019) proposed a novel form of normalization named group normalization (GN). Distinct from BN, GN normalizes features by grouping feature channels into small groups and computes the mean and variance for each small group. In this way, GN can normalize the feature tensors within each single sample in the batch. As a result, the normalization can be independent of the batch size. For the task of precipitation nowcasting using radar echo images, we discover the effectiveness of using GN as the normalization layer. We give a brief description about GN in the following.
In the case of 2D images, i = (i N , i C , i H , i W ) is the index of 4D tensor features in (N, C, H, W) order, where N is the batch axis, C is the channel axis, and H × W are the numbers of grids in the height and width directions ( Figure 2). GN performs the following computation: where and are trainable scale and shift, x is the feature computed by a layer, and i is the index. C is the number of channels, G is the number of groups and q is the number of channels per group. ⌊·⌋ is the floor operation, which means that r and t are in the same group if their indices along the C axis are equal in the sense of division.

ConvLSTM With Star-Shaped Bridge
Suppose that we observe a dynamical system of K measurements (e.g., radar echo images) in a time (e.g., radar scan cycle) over a spatial region with an M × N grid. Thus, the observation of these K measurements at any time can be represented by a tensor  ∈ R K×M×N . Define the K-length observation as  1 ,  2 , … ,  K . The spatiotemporal predictive learning problem is to predict the most likely L-length sequence in the future given the previous K-length observation. The goal can be formulated as follows: Compared with the standard LSTM, ConvLSTM (Shi et al., 2015) is able to simultaneously model the spatiotemporal structures by encoding the spatial information into tensors. In ConvLSTM, all inputs  t (t ∈ [1, … , K]), cell outputs  t , hidden states  t , and gates i t , t , g t , and o t are 3-D tensors. ConvLSTM determines the future state of a certain grid according to the inputs and past states of its local neighbors in the encoder-decoder architecture (see Figure 3): where = 1 1+exp(−x) is the sigmoid activation function, and * and ⊙ denote the convolution operator and the Hadamard product, respectively. If the states are viewed as hidden representations of moving objects, then ConvLSTM with a larger transitional kernel should be able to capture faster motions, while ConvLSTM with a smaller kernel can capture slower motions (Shi et al., 2015). The input gate i t , forget gate t , output gate o t , and input modulation gate g t control information flow across the memory cell  t .
In the conventional ConvLSTM architecture, as illustrated in Figure 3, the information is conveyed only upward and is updated horizontally. In our nowcasting task, the information from representations extracted at previous different level convolutional layers may be useful. In this paper, we introduce a novel transition  path named star-shaped bridge (SB) (Cao et al., 2019). SB adds more information from the last time step to make the feature flow in multilayer ConvLSTM more robust. Concretely, we concatenate outputs of all ConvLSTM layers and pass them to a convolution layer with kernel size 1 × 1 and then split the output to all ConvLSTM layers of the next time step using a residual connection (Figure 4). Throughout this paper, ConvLSTM with SB will be referred to as ConvLSTM for clarity.

Multisigmoid Loss
The CSI, which describes the ability to predict the occurrence above a given threshold (e.g., 20 dBZ), is widely applied in the meteorological community. CSI is defined as CSI = TP TP+FN+FP . For each pixel, we define terms as follows: TP = hits, FN = misses, and FP = false alarms. However, CSI is not a suitable loss function for gradient descent optimization due to its nondifferentiability. Shi et al. (2015) adopted cross entropy loss (CE), defined as where Î k is the scaled echo intensity of the observation at the pixel indexed k and I k is the predicted echo intensity. However, CE can only be regarded as a proxy index of CSI because Î k and Î k are not true probabilities, though the values are between 0 and 1. To overcome this drawback, we employ sigmoid loss L i to simulate CSI at threshold i: where c is the scale factor, a hyperparameter used to control the slope of the sigmoid function . In this work, we adopt i ∈ {20, 30}. Further, the multisigmoid loss is defined as

Experiment
In this section, we evaluate the performance of our model with respect to the radar echo data set. We also compare our model with some baselines, including COTREC and ConvLSTM with CE loss.

Setup of the Experiment
In the case of COTREC, the box arrays are set to have a fixed size of 19 × 19 pixels, the centers of these pixel arrays are spaced 5 pixels apart, and the entire COTREC vector field has a resolution of 99 × 99. The search radius is 11 pixels.
All of our deep learning-based experiments are implemented with PyTorch and conducted on four NVIDIA GTX 1080Ti GPUs. We train all models with the ADMA optimizer, and the learning rate is 0.002. The batch size is set to 4, and we employ optimization iterations of up to 60 epochs at most. Our network includes two-layer ConvLSTM, with each layer containing 128 hidden states and a 3 × 3 filter size. For GN, we set C to 128 and G to 4 for all convolution layers. We use the training set as input during parameter adjustment and Under the condition of the deep learning-based method, the contributions of GN and multisigmoid loss are investigated step-by-step. Additionally, we test the influence of different scale factors c of multisigmoid loss for robustness.

Results
All results are shown in Tables 1 and 2 and Figure 5. We find that the ConvLSTM without GN achieves some performance improvement. However, the average CSI30 over 30 min falls by 7.05%, which may be caused by statistical instability of batch estimation. When working with BN, ConvLSTM significantly outperforms COTREC for all investigated indices. The ConvLSTM with GN provides 24.5% and 33.1% improvements over the average CSI20 predicted during 30 and 60 min. It also raises the CSI30 (30 min) and CSI30 (60 min) by 9.29% and 33.5%. For investigating the influence of different scale factor c values in SSL loss, we carry out three experiments with SSL 10 , SSL 20 and SSL 30 as loss functions. As shown in Tables 1 and 2, c = 10 and c = 20 are appropriate, while 20 is the best choice. c = 30 is not as effective for our task because of the limited performance for CSI30. The performance degradation on the large scale could be attributed to the resultant gradient, which is too steep to search for a stable descent direction. The corresponding frame-by-frame quantitative comparisons are also presented in Figure 5. It can be seen from Figure 5 that our model produces the best result over all time steps. On the other hand, the contribution of GN is most obviously shown by the comparison between ConvLSTM+CE and ConvLSTM+CE+GN. In Figure 5, the results with the 30 dBZ threshold, COTREC is slightly better than ConvLSTM+CE and ConvLSTM+CE+GN at the first time step. The most probable explanation for this is that COTREC tries to include all details at small scales, which typically have short time step lifetimes and rapidly decorrelate, while effective prediction for small scale features with a long leading time is beyond the information capacity supplied by only echo images. Our model learned to focus on features at larger scales while ignoring the small scales, and this learned strategy slightly sacrifices the accuracy in the first one or two time steps to achieve better overall performance.  For comparison, a recent predicted case of COTREC and our model is shown in Figure 6. The three areas in the boxes are investigated. In Area1, our model achieves more reasonable intensity prediction and can more accurately infer the contour, especially at the boundary for the 60 min prediction. For Area2, COTREC gives a prediction with a northward bias and maintains the intensity at 60 min. Our model provides more accurate direction forecasting and underestimates the intensity in Area2. However, the level of underestimation is acceptable from the perspective of misses with the threshold of 30 dBZ. For Area3, notable differences exist. As for area1, our model offers reliable shape prediction and acceptable underestimated intensity with regard to the threshold of 30 dBZ. It is obvious that COTREC incorrectly estimates the echo movement speed 10.1029/2019EA000812 and shape compared to the observation. The limited performance of COTREC in Area1 and Area3 leads to serious misses and false alarms, which ultimately degrade the overall prediction performance. Note that the blurring effect of our model may be caused by the inherent uncertainties of the task. For long-term prediction, our model tends to blur the predictions to alleviate the error after the training process. Although the deep learning model performs better than COTREC for this case, it is worth noting that radar echoes during 2019 are not included in the test data set, and a robust conclusion should be drawn from a comparison conducted on the data set of 2019 in future work.

Summary and Discussion
In this paper, we introduce a novel ConvLSTM network with the star-shaped bridge layer for passing information. Information can be transferred both horizontally and vertically. Owing to the nondifferentiable nature of CSI, which prevents it from serving as a loss function for the gradient descent method, we employ a special multisigmoid loss function as a simulation of CSI. Additionally, we discuss the benefits of group normalization. The result shows that our model achieves state-of-the-art performance. The contributions of this paper are stated as follows: 1. We present the ConvLSTM with star-shaped bridge for precipitation nowcasting. 2. We introduce the GN technology to refine the convergence performance in optimization for ConvLSTM. 3. We employ a special loss function that is more in line with CSI.
It should be noted that GN offers the best performance improvement relative to COTREC due to its capability of stably estimating batch statistics; meanwhile, the special loss function is responsible for the improvement to some extent. Although COTREC achieves better performance with respect to intensity in some specific areas, the obvious deviations in the direction and shape of the forecast degrade the overall performance. Our model achieves a more reliable forecast with an acceptable underestimation level in terms of a proper threshold.
It should be noted that our model suffers from a blurring effect, especially for longer future time steps. One possible interpretation is that there exists an intrinsic predictability property of nonlinear systems using the available data sources. The more blurred the forecasted images are, the more the uncertainty of the system will grow with increasing lead time. To further leverage the ability of the complex deep learning-based model to obtain sharper and more accurate predictions, combining it with physical knowledge from the numerical weather forecast model and utilizing multimodal data, such as satellite data and ground weather observations, merit further exploration.