Optimization of Deep Learning Precipitation Models Using Categorical Binary Metrics

This work introduces a methodology for optimizing neural network models using a combination of continuous and categorical binary indices in the context of precipitation forecasting. Probability of detection and false alarm rate are popular metrics used in the verification of precipitation models. However, machine learning models trained using gradient descent cannot be optimized based on these metrics, as they are not differentiable. We propose an alternative formulation for these categorical indices that are differentiable and we demonstrate how they can be used to optimize the skill of precipitation neural network models defined as a multiobjective optimization problem. To our knowledge, this is the first proposal of a methodology for optimizing weather neural network models based on categorical indices.


Introduction
It is increasingly common in meteorology to use machine learning approaches for identifying patterns in the atmosphere using large amounts of historical data (Dueben & Bauer, 2018;Scher & Messori, 2019;Ukkonen & Mäkelä, 2019;Weyn et al., 2019). This approach, of extracting the underlying physical relationships in the atmosphere from data, opens an opportunity to explore new algorithms that optimize the output based on different verification metrics. In this work, we propose a methodology to train neural network (NN) precipitation models using a loss function, which combines continuous and binary, or dichotomous [yes, no], metrics.
Verification of dichotomous events has been extensively explored in the context of weather forecasting (Casati et al., 2008;Ebert et al., 2013;Jolliffe & Stephenson, 2012;Stephenson, 2000). Rain, frost, flood, and fog are examples of dichotomous meteorological events.
Verification of categorical binary events usually starts with the construction of a contingency table, which represents the frequency of "yes" and "no" model predictions and observed occurrences. In weather forecasting, thresholds are usually defined to categorically determine the occurrence of weather events from Table 1 Contingency continuous variables by being above or below these thresholds. Several popular indices can be derived from contingency tables, such as probability of detection (POD) and false alarm rate. These indices allow measuring different aspects of the quality of dichotomous predictive models and are popular evaluation metrics in meteorology studies.
Gradient descent is a versatile and popular technique in machine learning and currently constitutes the de facto methodology to train artificial NN models. Gradient descent prescribes an iterative process that computes the derivative of the loss function (model error) and updates the model parameters following the direction that minimizes this loss until a local or global minimum is reached.
Weather models trained using gradient descent can be evaluated using binary indices, but these indices cannot be naturally integrated in the optimization process, as they are not differentiable. Gradient descent requires smooth, differentiable loss functions for determining its minima points. Categorical binary indices are built using logical comparison operators (<, >), which define a function containing a discontinuity at the threshold point and therefore are nondifferentiable.
The problem of optimizing nondifferentiable categorical classifiers has been explored before in the context of machine learning (Herschtal & Raskutti, 2004;Yan et al., 2003). However, in weather forecasting, precipitation generated by Numerical Weather Prediction (NWP) is usually a quantitative variable and is verified using a variety of quantitative and categorical verification metrics. We propose a methodology for combining both types of metrics and optimizing models that perform well using these metrics. This problem can be formulated as a Pareto or multiobjective optimization problem in which no single solution exists that simultaneously optimizes each objective individually.
In this work we present an alternative formulation of binary indices, which present the desired characteristics of being both continuous and differentiable. We show how these indices can be integrated in the loss function of weather models trained with gradient descent, learning to optimize them. In the experimental section we apply this methodology to train a deep learning network used to predict gridded total precipitation using NWP geopotential heights as input. We demonstrate how the proposed indices are used to optimize the skill of NN models based on different categorical binary metrics. To our knowledge, this is the first proposal of a methodology for optimizing weather precipitation models using a combination of categorical and quantitative indices.
This manuscript is structured as follows: Section 2 briefly covers the derivation of classical categorical binary indices and presents the theoretical basis of the equivalent differentiable indices. Section 3 presents the data, model and experiments for testing the behavior of the proposed indices. Section 4 presents the results of the experiments demonstrating how artificial NN models can be optimized using categorical binary metrics. We finish with Section 5, which provides conclusions and ideas on how the proposed methodology can be further developed in future works.

Categorical Binary Verification Metrics
Performance of binary forecasts can be measured as a function of hits, misses, false alarms, and true negatives, which relate observed and forecasted events. The four combinations of forecasts [yes, no] and observations [yes, no], called the joint distribution, can be represented using a contingency table (see Table 1).
Contingency tables are a useful way to represent the skill and errors made by deterministic models-a perfect forecast contingency table contains only hits and correct negatives and no misses or false alarms. A variety of popular categorical statistics can be computed based on the indices in this contingency table to describe different aspects of the skill of a model. Probability of detection POD = hits∕(hits + misses), also known as hit rate, and the probability of false detection POFD = false alarms∕(false alarms + true negatives) are examples of these statistics. We refer readers to "Forecast verification: A practitioner's guide in atmospheric science" (Jolliffe & Stephenson, 2012) for a detailed and rigorous coverage of categorical binary indices and weather forecast verification in a broader context.
Categorical binary metrics constitute also a popular choice to determine the skill of quantitative models, such as NWP (Accadia et al., 2003;McBride & Ebert, 2000). Quantitative NWP models generate a continuous range of output values, such as precipitation, wind, and temperature. Contingency tables can be computed by setting a threshold value for an event and using the [<, >] relational operators to transform forecast continuous values into its binary [yes, no] representations or [rain, dry] in the case of precipitation.
Logical relational operators define a step function, which is normally represented using 0 to denote the boolean "no" or "false" and 1 for "yes" or "true" values. The transition between 0 and 1 happens at the threshold value creating a discontinuity or singularity at this point. These functions are noncontinuous and therefore nondifferentiable.

Differentiable Categorical Binary Metrics
We propose an alternative formulation for these categorical verification indices using smooth and differentiable functions. Specifically, we use the sigmoid function to represent a smooth transition between the boolean values at the threshold point. The following formula defines the sigmoid function: Figure 1 represents the sigmoid function as a differentiable alternative to the "<" and ">" step functions. Parameter defines the slope of the sigmoid function, where larger values of correspond to a steeper transition in the output. In the case of considering a threshold value the sigmoid variable X gets translated by this amount, resulting in the case of the ">" operator: These sigmoid functions can be used to approximate the step function and compute a differentiable version of the contingency table previously presented. Each entry in the contingency table is calculated using an element-wise product of the vectors containing the observations and predictions compared with the threshold value . For example, the following expression calculates Hits: The previous expression can be made differentiable, by substituting the comparison in the "predicted" term with a sigmoid function. This new expression provides a gradient allowing model outputs to be optimized around the threshold. Differentiable categorical statistics such as POD or POFD can be formulated using 10.1029/2019MS001909 sigmoid functions to replace the comparison operators. For example, the differentiable versions of POD and POFD can be defined as follows: Other differentiable categorical indices can be derived using the new indices in the differentiable contingency table. We refer readers interested in the derivation of these indices to the interactive notebook included in the accompanying code repository (see end of Section 4).

Data and Experiments
This section presents the data set, model and experiments used to assess the proposed differentiable categorical binary indices. We choose a NN model that is trained to derive the total precipitation field using geopotential height as input.
The NN model is trained to learn the relationship between geopotential values and total precipitation grids. Precipitation is represented using continuous values for each grid cell and during training the models learns to minimize the error between the predicted grid and ERA-Interim's total precipitation. The error of continuous precipitation fields is typically quantified using the root-mean-square error metric (Stanski et al., 1989;Wardah et al., 2011). In machine learning, the use of mean squared error (MSE) is preferred to root-mean-square error as it is computationally simpler but equivalent in terms of their local and global minima.

Data Set
The ERA-Interim  global climate reanalysis data set produced by the European Centre for Medium-Range Weather Forecasts is used to run the experiments. ERA-Interim contains reanalysis data from 1979 to present with a 6-hr temporal resolution. The spatial resolution of the data set is approximately 80 km (reduced Gaussian grid N128) on 60 vertical levels. ERA-Interim data are publicly accessible at the European Centre for Medium-Range Weather Forecasts's Public Data sets web interface .
For our experiments, we choose geopotential height (z) and total precipitation (tp) variables. We consider a subset of the original data centered on the midlatitudes rectangular region delimited by the coordinates comprising ( Geopotential height at the following pressure levels [1,000,900,800,700,600,500,400,300,200,100] hPa is used as input to the model, and total precipitation constitutes the output or predicted field. Resulting geopotential height data are represented as a four-dimensional numerical array with shape [58,440, 80, 120, and 10] corresponding to dimensions [time, latitude, longitude, and height]. Similarly, the total precipitation is represented by a three-dimensional numerical array with shape [58,440,80,120] representing the [time, latitude, and longitude] dimensions. For clarification, the ERA-Interim total precipitation parameter is originally represented using 3-hr period accumulations, which we further aggregate into 6-hr periods to match the 6-hr frequency of the geopotential height field. Figure 2 represents the geographic area (bottom left) and the correspondence between the geopotential height and the total precipitation field time series (right).

NN Model
Convolutional encoder-decoder networks are a type of NN that are able to map between multidimensional inputs and outputs by learning a compressed representation of the data. These networks have been used in many different domains to perform classification, segmentation, and regression tasks (Krizhevsky et al., 2012;Long et al., 2015). In the field of meteorology similar networks have been used to model extreme weather events (Liu et al., 2016) and general circulation of the atmosphere (Scher, 2018).
The proposed model is a specific type of convolutional encoder-decoder network called U-net (Ronneberger et al., 2015). We refer readers to our previous work (Larraondo et al., 2019) for a detailed comparison between different encoder-decoder architectures for the case of deriving precipitation from geopotential fields. Figure 3 shows the architecture of the U-net network representing the changes it performs to the dimensionality of the data. This network is composed by two symmetric parts, which perform a compression of the input data (encoder) and a subsequent decompression that recreates the output space (decoder). The chained convolution operations are able to capture the spatial relationships in the data at different scales and extract the relevant features that relate the input and output spaces. In our case, these are the geopotential heights (at 10 atmospheric levels) and total precipitation. Numbers at the top of this figure represent the dimensions of the images at each stage of the convolutional neural network model. Similarly, numbers at the bottom represent the channels or features at each layer of the network. The input to the network is the 10 geopotential levels, and its output is one image representing the total precipitation field.

Experiment Design
The objective of the experiments presented in this section is to demonstrate how the proposed differentiable categorical indices can be used to optimize the performance of NN models. We define an objective function using a combination of these indices and use it to train a U-net model that predicts ERA-Interim total precipitation with geopotential levels as input. In particular, we choose POD and POFD as the indices to optimize. These indices measure different aspects of the performance of a model and a combination of them is used  to generate relative operating characteristics (ROC) (Fawcett, 2006), which is a popular graphical method often used to represent the overall skill of categorical weather models.
POFD measures the fraction of the observed "no" events incorrectly forecast as "yes," This index ranges from 0 to 1, being 0 the score of a perfect model. POD measures the fraction of the observed "yes" events correctly forecast. POD also ranges from 0 to 1, but differently than POFD, its optimal value is 1, the maximum. Improving model performance, using these two indices, requires maximizing POD and minimizing POFD scores. POD cannot be directly used in gradient descent optimization as minimizing POD would minimize this index resulting in a model with no skill at all. For performing optimization in the right direction POD needs to be inverted, so minimization corresponds to an increase in the skill of the model. For this purpose, we use the false negative rate (FNR), which is the complementary index to POD, formalized through the following equation: The loss function used in the experiments uses the differentiable versions of FNR and POFD, defined using the equations introduced in Section 2.2. The sigmoid functions used to compute the differentiable indices in the experiments set a fixed value of = 1, which works well when using the original unscaled values of precipitation. Section 5 discusses the effect of this parameter in the results of the experiments.
Optimizing a model which combines these two metrics, FNR and POFD, results in finding a balance between two opposing forces. Reducing FNR generates overconfident models, which predict precipitation everywhere, whereas minimizing POFD generates underpredicting models with a complete absence of precipitation. Our objective is to use these indices to enhance the output of quantitative models trained to minimize the MSE. In the case of precipitation prediction, MSE is commonly used for verification in the literature (Jolliffe & Stephenson, 2012;Murphy, 1988). We propose the following loss function combining MSE with the differentiable versions of FNR and POFD, following the method defined in Section 3: In this equation and are constant parameters that control the relative weight of each categorical index in the overall loss function. MSE acts as a regularization term allowing the model to output continuous precipitation values in the range of the original ERA-Interim total precipitation variable. Without the MSE term, the network would learn to differentiate categorical precipitation around the defined threshold but would not account for quantitative differences in the range of precipitation values.
In the next section we compare the output of the U-net model trained using different values of and in the loss function. To carry out the experiments we apply a 70/30 split over the temporal dimension of the ERA-Interim data set for training and testing the results (training split contains years from 1979 to mid-2005 and validation split from mid-2005 to 2018). The same splits are consistently applied to train each model, so results can be fairly compared using MSE, POD, and POFD as measures of performance.
The baseline performance is set by training the U-net network with MSE exclusively, which corresponds to setting both and constants to 0. Iterating through different combinations in the values of and , we compare the resulting models performance relative to the baseline to understand the influence of these categorical indices.

Results
We start this section by setting up a baseline for the model comparison using just MSE in the loss function, which corresponds to setting both = 0 and = 0 in the loss function presented in the previous section. We train the U-Net model using the predefined ERA-Interim splits during 100 epochs-iterations over the whole training data set.
During training, we assess the model performance on the validation split at the end of each epoch. The skill of the model is measured using MSE, FNR, and POFD considering = 1.0 (mm/hr) as the threshold value that discriminates precipitation. As a clarification for the reader, the standard POD and POFD indices, and not their differentiable versions, are used for verification. The differentiable indices are used in the loss function, which is derived at the back propagation phase during the NN model training. Figure 4 shows the evolution of the validation MSE, FNR, and POFD scores across the 100 epochs of training of the U-net model. The first iterations result in a rapid reduction in these indices, which slow down and stabilize toward the end of the training period.
Using this model as the reference, we now train similar models using different combinations of the and constants in the loss function. We consider three scenarios, in the first two we fix one of the constants to 0 and traverse the {2, 4, 8} values with the other one. The third scenario traverses the same set of values for both variables. In total, we end up with nine models trained with the resulting combinations of and . Figure 5 shows the evolution during training of two of these models, in the same way as the previous figure. On the right, for the model trained with = 2 and = 0, we see how the POD score is significantly lower than the baseline model at the expense of an increase in MSE and POFD. Similarly, on the left, the model trained with = 0 and = 2 shows a significant decrease in the POFD score, which penalizes MSE and POD.   Table 2 contains the score values of each model, with the baseline model in the first row. FNR has been converted to its complementary POD, which is a more familiar score in weather forecasting. This table shows the relationships between the values of both constants in the loss function and the resulting variation of the scores. Figure 6 provides a visual understanding of the effect of these constants in the precipitation field learned by the models. Taking one sample from the validation split, which corresponds to 16th September 2016, we sequentially represent the original ERA-Interim total precipitation field (not used during training), the output of the baseline U-net model and, in the second row, the three models corresponding to the extremes for each of the scenarios considered. Figure 6c shows an extremely conservative model, which only predicts rain in the regions where there is a strong signal. On the other extreme, in Figure 6d, POD has a large weight in the loss function and the model becomes overconfident, representing precipitation values greater than the threshold = 1. The third model corresponding to = 8 and = 8 achieves a significantly better POD score with a slight sacrifice of POFD and a significant degradation in MSE, by looking at the values in Table 2. Visually, this model tends to generate precipitation with crisper edges than the reference but it also seems to overestimate precipitation values on average, which probably explains the increase in MSE.
The proposed objective function establishes a three-way trade-off between the scores. The sensitivity of these scores is, however, nonsymmetric. For example, Figure 7 represents the evolution of the three scores for the two scenarios where one the constants is set to 0. For small values of and there is a  Another interesting observation about the results is that POFD in the baseline model is low. The relative improvement achieved by weighting POFD with large values might not compensate the penalization in the rest of scores. This fact becomes apparent when assessing performance with compound indices, such as ROC (Fawcett, 2006).
ROC is a popular metric often used to assess the skill of categorical weather forecasts (Kharin & Zwiers, 2003;Mason, 1982), which combines POD and POFD scores measured at different thresholds in a single plot. Area under the curve (AUC) (Marzban, 2004) measures the skill of a ROC plots in the range [0,1], being 1 the score of the perfect model. Figure 8 represents the ROC plots and corresponding AUC scores for the baseline and = 2 and = 0 model using the following threshold values = {0.5, 1, 2, 5, 10}. We can see how by increasing and the shape of the ROC plot changes by bringing the points closer to the top and left sides of the figure, respectively. However, the nature of the optimization problem makes it impossible to achieve the perfect AUC score.
Increasing penalizes POD so strongly that the resulting AUC scores are worse than the baseline model. For our models, we find the best combination at [ = 2, = 0], which results in an AUC score of 0.982 compared to 0.977 for the baseline model. Increasing the value of further penalizes POFD index resulting in lower AUC scores. Using the methodology presented in this work, it would be possible to design new loss functions that optimize NN models based on the AUC score. Figure 9 represents the results in Table 2 as points in a three-dimensional scatter plot. For interpretability, points have been projected onto each of the three orthogonal planes defined at the origin of coordinates. Points in the vertical planes, containing the MSE axis, define a Pareto front, represented by a blue line in Pareto efficiency studies the relationship between the variables in a multiobjective optimization problem. Optimality in multiobjective problems happens when an improvement in any one individual criterion makes at least one of the other criterion worse off. Pareto fronts are a graphical representation of the optimal points, as lines or surfaces, using multidimensional plots. In our case, these Pareto fronts represent the relationship and trade-offs that POD and POFD present in relation to MSE. The front defines the dependency between both variables showing that it is impossible to optimize both variables simultaneously. The horizontal plane, defined by the POD and POFD axes, shows a nearly linear relationship between both variables, and no Pareto relationship is observed.
The models used in this section are implemented using Keras (Chollet et al., 2018), a high-level NNs interface written in Python and TensorFlow (Abadi et al., 2016) as back-end. The code and data to reproduce the experiments presented in this manuscript are available at this repository (https://github.com/prl900/ weather_encoders). This repository contains a module with the implementation of differentiable version of some of the most popular categorical indices used in meteorology, which can be used independently by external models.

Conclusions and Future Work
This manuscript introduces a differentiable version of categorical binary indices that allow training NNs optimizing categorical indices. These indices use the sigmoid function to approximate the logical comparison operators (<, >), making them continuous and therefore differentiable. Building upon these sigmoid functions, we define differentiable versions of well-known categorical indices, such as POD and POFD. We demonstrate how these differentiable indices can be used to train models using gradient descent methods and optimize their loss function based on them.
In our experiments we use a specific deep learning NN architecture called U-net encoder-decoder to learn the spatial relationships between NWP variables, ERA-Interim geopotential height, and total precipitation. The baseline model is optimized to minimize MSE, and we propose a new objective function that combines MSE with POD and POFD indices. The experiment results demonstrate how the skill of the model can be optimized toward a specific index by weighting individual scores in this objective function. Section 2.2 introduces parameter in the definition of the sigmoid function. This parameter controls the steepness of the sigmoid function, where larger values of result in steeper sigmoids and therefore better approximations to the step function. Although we expect to be related to the scale of the precipitation values and have an influence in the optimization results, we did not find significant differences in the results for larger and smaller values of = 1. At this point, we do not fully understand why the combined loss function seems to be almost invariant to changes in the shape of the sigmoid (or scale of the precipitation values). We are currently exploring this relationship, experimenting with new scale invariant loss functions, and we hope to give an answer to these questions in future works.
Currently, the values of the constants that weight the different scores in the proposed objective function have to be determined relative to the model and data. Although categorical variables represent probabilities bounded between [0,1], the regression term (i.e., MSE and mean absolute error) does not usually have an upper bound. We are planning to carry out further research to come up with new objective functions containing normalized constants which are generic and invariant under scaling of the input data.
Another interesting avenue for research is to explore the definition of high-level objective functions for optimizing models using a combination of scores. Weather forecasting verification is normally performed using a defined suite of tests and scores. Being able to design objective functions according to the verification suites would result in better performing models.