Volume 13, Issue 12 e2021MS002575
Research Article
Open Access

Controlled Abstention Neural Networks for Identifying Skillful Predictions for Regression Problems

Elizabeth A. Barnes

Corresponding Author

Elizabeth A. Barnes

Department of Atmospheric Science, Colorado State University, Fort Collins, CO, USA

Correspondence to:

E. A. Barnes,

[email protected]

Search for more papers by this author
Randal J. Barnes

Randal J. Barnes

Civil, Environmental, and Geo-Engineering, University of Minnesota, Minneapolis, MN, USA

Search for more papers by this author
First published: 04 December 2021
Citations: 3

This article is a companion to Barnes and Barnes (2021), https://doi.org/10.1029/2021MS002573.

Abstract

The earth system is exceedingly complex and often chaotic in nature, making prediction incredibly challenging: we cannot expect to make perfect predictions all of the time. Instead, we look for specific states of the system that lead to more predictable behavior than others, often termed “forecasts of opportunity.” When these opportunities are not present, scientists need prediction systems that are capable of saying “I don't know.” We introduce a novel loss function, termed “abstention loss,” that allows neural networks to identify forecasts of opportunity for regression problems. The abstention loss works by incorporating uncertainty in the network's prediction to identify the more confident samples and abstain (say “I don't know”) on the less confident samples. The abstention loss is designed to determine the optimal abstention fraction, or abstain on a user-defined fraction using a standard adaptive controller. Unlike many methods for attaching uncertainty to neural network predictions post-training, the abstention loss is applied during training to preferentially learn from the more confident samples. The abstention loss is built upon nonlinear heteroscedastic regression, a standard computer science method. While nonlinear heteroscedastic regression is a simple yet powerful tool for incorporating uncertainty in regression problems, we demonstrate that the abstention loss outperforms it for the synthetic climate use cases explored here. The implementation of the proposed abstention loss is straightforward in most network architectures designed for regression, as it only requires modification of the output layer and loss function.

Key Points

  • A simple neural network approach for adding uncertainty to climate regression problems is explored

  • A new abstention loss is introduced to identify, and preferentially learn from, more confident samples

  • The abstention loss outperforms other regression loss approaches for multiple climate use cases

Plain Language Summary

The earth system is exceedingly complex and often chaotic in nature, making prediction incredibly challenging: we cannot expect to make perfect predictions all of the time. Instead, we can look for specific states of the system that lead to more predictable behavior than others, often termed “forecasts of opportunity.” When these opportunities are not present, scientists need prediction systems that are capable of saying “I don't know.” We present a method for teaching neural networks, a type of machine learning tool, to say “I don't know” for regression problems. By doing so, the neural network focuses less on the predictions it identifies as problematic and focuses more on the predictions where its confidence is high. In the end, this leads to better predictions.

1 Introduction

The earth system is exceedingly complex and often chaotic in nature, making prediction incredibly challenging: we cannot expect to make perfect predictions all of the time. Instead, we look for specific states of the system that lead to more predictable behavior than others, often termed “forecasts of opportunity” (Albers & Newman, 2019; Barnes, Mayer, et al., 2020; Mariotti et al., 2020; Mayer & Barnes, 2021). When skillful forecast opportunities are not present, scientists need prediction systems that are capable of saying “I don't know.” While this concept of forecasts of opportunity stems from weather and climate predictions, the general idea is far broader than this: a forecast of opportunity framework may be beneficial when certain predictors are only helpful under certain circumstances. Additionally, if the predictor data has unknown errors or corrupted values (e.g., corrupted pixels in satellite imagery), a system that can say “I don't know” can act as an effective data cleaner: identifying the more skillful predictions, when they occur.

Many approaches to identify skillful forecasts of opportunity already exist. For example, retrospective analysis of the forecast can provide a sense of the physical circumstances that can lead to forecast successes or busts (e.g., Rodwell et al., 2013). The ensemble spread can also give a sense of uncertainty in numerical weather prediction systems (e.g., Van Schaeybroeck & Vannitsem, 2016). Albers and Newman (2019) used a linear inverse modeling approach to identify confident subseasonal predictions and showed that these more confident predictions were indeed more skillful. Recently, Mayer and Barnes (2021) and Barnes, Mayer, et al. (2020) suggested that machine learning, specifically neural networks, may be a useful tool to identify forecasts of opportunity for subseasonal-to-seasonal climate predictions. Specifically, a classification network is first trained, then the predicted probabilities are ordered from largest to smallest. A selection of predictions with the highest probabilities are identified as possible forecasts of opportunity. While Mayer and Barnes (2021) and Barnes, Mayer, et al. (2020) show that this approach works well for classification tasks (i.e., predicting a specific category) where the network is already tasked with predicting a probability, it is less clear how one might apply this methodology to regression tasks (i.e., predicting a continuous quantity).

Many of the current machine learning approaches used to identify forecasts of opportunity, including those described above, are applied post-training. The network is first trained, and then the model confidence is assessed. Instead, here we build on the work by Thulasidasan et al. (2019) and Thulasidasan (2020) to develop a deep learning abstention loss function for regression tasks that teaches the network to say “I don't know” (abstain) on certain samples during training. The resulting controlled abstention network (CAN) preferentially learns from the samples in which it has more confidence and abstains on the samples in which it has less confidence. The CAN is designed to identify the optimal abstention fraction, or abstain on a user-defined fraction via a standard adaptive controller; both approaches ultimately lead to more accurate predictions than our baseline approach.

Alternative methods have recently been suggested for abstention during training (e.g., Geifman & El-Yaniv, 2019a2019b; Sarker et al., 2020; Yuan et al., 2020), including a companion paper to this one (i.e., Barnes & Barnes, 2021a). Unlike these alternative methods, which are designed specifically for classification problems, the CAN approach presented here can be easily implemented in most any network architecture designed for regression, as it only requires modification of the output layer and loss function.

Probabilistic deep learning has emerged as a key component in the application of machine learning in the geosciences. These efforts include importing general computer/data science tools into the geosciences (e.g., generative adversarial networks), as well as the development of new tools for specific geoscience use-cases (e.g., Crommelin & Edeling, 2021; Dorling et al., 2003; Foster et al., 2021; Gagne et al., 2020; Guillaumin & Zanna, 2021; Leinonen et al., 2020; Scher & Messori, 2021). In this paper, we do both. We demonstrate the behavior of the CAN on a simple 1D example, and then on synthetic climate data where the correct answer is known. We present two use cases with the climate data. The first use case explores the utility of the CAN to identify climate forecasts of opportunity and is modeled loosely after global teleconnections associated with the El Niño Southern Oscillation (ENSO; e.g., McPhaden et al., 2006; Yeh et al., 2018). The second use case explores the utility of the CAN to act as a data-cleaner by identifying input samples with corrupted pixels and preferentially learning on the uncorrupted samples.

Section 2 introduces the synthetic climate data and general neural network architecture. Section 3 discusses the baseline loss function and the CAN in detail, and Section 4 presents the results. Additional discussion on the approach is provided in Section 5 and conclusions in Section 6.

2 Data and Neural Network Overview

2.1 Synthetic Climate Data

To demonstrate the utility of the CAN, we use the synthetic benchmark data set introduced by Mamalakis et al. (2021). While Mamalakis et al. (2021) provides an extensive description of this data, we give a brief overview here. The data set consists of input fields xi and output series yi (where i denotes the ith sample), which is a function of the input. The input fields represent monthly anomalous global sea surface temperatures (SSTs) generated from a multivariate normal distribution with a correlation matrix estimated from observed SST fields (https://psl.noaa.gov/data/gridded/data.cobe2.html). The ith input sample consists of one map of SST anomalies, denoted as xi. Mamalakis et al. (2021) then defines the global response yi to sample xi as the sum of local, nonlinear responses. Specifically,
urn:x-wiley:19422466:media:jame21502:jame21502-math-0001(1)
where g represents the grid point and Fg is defined locally (at each grid point g) by a piecewise linear function. The slopes βn (where n is an integer that runs from 1 to the number of piecewise linear segments, set here to 5) of each local function are chosen randomly from a multivariate normal distribution with correlation matrix, once again, estimated from observed SST fields. Thus, the data set of Mamalakis et al. (2021) consists of input maps of SSTs with spatial correlations indicative of observed SSTs, but where each input map is independent of the others. yi then represents the sum of contributions from each grid point across the globe, where that contribution is a nonlinear function (specifically, a piecewise linear function) of the SST value at that grid point.

We make modifications to the data set of Mamalakis et al. (2021) for the experiments explored here. First, to speed up training time, we reduce the number of grid points (pixels) from that used by Mamalakis et al. (2021) to 60° longitudes and 15° latitudes for a total of 900 grid points per input map. An example input map is shown in Figure 1; its corresponding y given in the title. Second, as discussed in detail in Section 4.2, we intentionally shuffle the labels of specific samples to loosely reflect forecasts of opportunity related to climate teleconnections.

Details are in the caption following the image

General controlled abstention network architecture used for the experiments. A map of synthetic sea-surface temperature anomalies is fed into a fully connected network tasked with predicting μ and σ for that sample.

2.2 Network Architecture and Training

For regression problems, it is typical to have a single output unit that provides the prediction by the network. Here, we add uncertainty estimates to our regression network by simply adding an additional output unit. We give these two output units the names μ and σ as shown in Figure 1. μ denotes the predicted value while σ denotes the uncertainty related to that prediction. As we will show, we can take our interpretation even further and say that the network outputs a probabilistic prediction (conditional probability distribution) for the ith sample in the form of a normal distribution with mean μi and standard deviation σi.

We train a fully connected feed-forward network with two hidden layers with 50 and 25 units, respectively. As described above, the output layer consists of two units. The specific form of the loss functions will be detailed in Section 3. We train with a ReLU (rectified linear unit) activation function, learning rate of 0.0005, and batch size of 32. Since the second output unit (denoted by σ in Figure 1) cannot be negative, we constrain it to being positive through the network setup. We train on 8,000 samples, validate on 5,000 samples, and test on 5,000 samples. While we could train on a much larger data set, we have intentionally kept the sample size relatively small to demonstrate the utility of the CAN when the sample size is relatively low—as is the case for many geoscience applications. All quantities and figures are computed from the testing data unless otherwise specified.

We employ early stopping to automatically determine the optimal number of epochs to train. Specifically, the network stops training when the validation loss stops decreasing, with a patience of 60 epochs, and the weights for the epoch with the best performance on the validation loss is saved. Specifically for the CAN, we select the best performing epoch from epochs after the spin-up period (as will be defined in Section 3.3.2), but we only consider epochs where the validation abstention fraction (i.e., the fraction of samples for which the network says “I don't know”) is within 0.1 of the abstention setpoint determined by the user (see Section 3 for more details). For all examples shown here, 20 unique networks are trained for each configuration (i.e., baseline ANN and CAN) by varying the randomly initialized weights.

The network was trained using Python 3.7.9 and TensorFlow 2.4.

3 Abstention Method and Loss

3.1 Baseline Network With Log-Likelihood Loss

The baseline artificial neural network (ANN) has the architecture shown in Figure 1 and trains using the negative log-likelihood loss defined for sample xi as,
urn:x-wiley:19422466:media:jame21502:jame21502-math-0002(2)
where pi is the value of the probability density function of a normal distribution (urn:x-wiley:19422466:media:jame21502:jame21502-math-0003) with mean μi and standard deviation σi:
urn:x-wiley:19422466:media:jame21502:jame21502-math-0004(3)

This baseline model predicts μi and σi for each sample, where μi is the model's best guess of yi and σi is the associated uncertainty. This approach is called nonlinear heteroscedastic regression in the computer science literature (e.g., Duerr et al., 2020, Sections 4.3.3 and 5.3.2), and Barnes, Barnes, and Gordillo (2021) provide an basic overview of its use for geoscience applications. The approach was first proposed by Nix and Weigend (19941995), and extended by Williams (1996).

Once the network is trained, we can invoke abstention on the less certain predictions by thresholding on σ. For example, the 20% most confident predictions are determined as the smallest 20% σ values (or the 20th percentile of the predicted σ). As we will show, this thresholding approach for abstention is itself very powerful and can be used as a simple way to add uncertainty to regression networks. In addition, this baseline approach will serve as a comparison for the CAN.

As an additional baseline, we will also compare our results with those obtained by training a standard feed-forward network containing a single output unit and a loss function defined by the mean absolute error (MAE). In this case, the network does not quantify uncertainty (i.e., σ). Consequently, only summary statistics over all testing predictions are provided.

Throughout this paper, we use “coverage” to denote the fraction of samples for which the network makes a prediction, and “abstention” to refer to the fraction of samples for which the network does not make a prediction. Thus, the percent coverage is always 100% minus the percent abstention. For the baseline approach, abstention and coverage is computed post-training based on the predicted uncertainties σ while for the CAN, these quantities are determined during the training itself (see next section).

3.2 Controlled Abstention Network

3.2.1 Abstention Loss

Unlike the baseline ANN, the CAN loss is designed to identify the less confident predictions so as to preferentially learn from the more confident predictions. The CAN loss for sample xi is defined as,
urn:x-wiley:19422466:media:jame21502:jame21502-math-0005(4)
where α controls the amount of abstention (see next subsection) and qi represents the prediction weight defined as,
urn:x-wiley:19422466:media:jame21502:jame21502-math-0006(5)

κ is a data-specific scale (see below). The prediction weight qi tells the CAN how much it should consider sample i when it reduces the total loss during backpropagation. Note that Equation 4 is very similar to the abstention loss of Thulasidasan (2020, Chap. 4) and Barnes and Barnes (2021a) for classification networks.

The loss above works by increasing σi values on samples that the CAN identifies as less certain. In this way, one can define abstention based on a threshold σ. Specifically, we define abstention by the CAN when the predicted σi > τ. To define τ, let urn:x-wiley:19422466:media:jame21502:jame21502-math-0007 denote the mth percentile of the predicted validation σ at the end of the spin-up period (as will be defined in Section 3.3.2). Then urn:x-wiley:19422466:media:jame21502:jame21502-math-0008 where m is the percent coverage setpoint. For example, for a coverage setpoint of 80% (abstention setpoint of 20%), τ is set to the 80th percentile of predicted validation σ at the end of the spin-up period: urn:x-wiley:19422466:media:jame21502:jame21502-math-0009. Note that since τ is defined by the validation data at the end of the spin-up period, it remains fixed during training and evaluation of the testing data.

We define urn:x-wiley:19422466:media:jame21502:jame21502-math-0010. This definition of κ is something that the user can modify. For example, setting κ = τ is an obvious choice. However, we found that setting urn:x-wiley:19422466:media:jame21502:jame21502-math-0011 outperformed κ = τ and worked for all experimental setups here; consequently, we did not explore further tuning of this parameter.

To summarize this section, the abstention loss looks a lot like the baseline loss (Equation 2). The main difference is the use of an additional scaling factor q and an additional term that penalizes the network for large σ predictions. This penalty is modulated by α. κ and τ are parameters set by the network during the spin-up training period. κ acts as a scaling parameter on σ within the loss function. Samples with σ larger than κ contribute less to the loss function, while samples with σ smaller than κ contribute their full amount. τ, on the other hand, sets the threshold used to define abstention and is used by the adaptive controller (see next section) when the user wishes to set a target coverage fraction.

3.2.2 Setting the Abstention Setpoint

The abstention loss, as defined in Equation 4, can be used in two distinct ways, depending on how α is determined. The first way is to set α to a predetermined constant. By doing this, the network is penalized equally throughout training for assigning high σ values. If α is chosen correctly, the network can learn the optimal percent coverage from the data set, where percent coverage is defined as the percent of samples on which the network makes a prediction (i.e., the percent coverage is 100% minus the percent abstained). When α is held constant, the coverage setpoint is not set by the user and so we set τ = κ. Physically, this represents the fact that the definition of abstention is set by the 90th percentile of the predicted validation σ values at the end of spin-up (i.e., urn:x-wiley:19422466:media:jame21502:jame21502-math-0012). This works well because this same value is also used to define κ, the normalization factor used to set the confidence q in Equation 4.

Alternatively, α can be adaptively modified throughout training so that the network abstains on a specified fraction of the training samples. Inspired by the success reported in Thulasidasan (2020, Chap. 4), we implement a discrete-time proportional-integral-derivative (PID) controller (velocity algorithm) to modulate α throughout training (e.g., Visioli, 2006, Equation 1.38). A PID controller is a simple, and widely used, feedback algorithm that changes a control parameter (e.g., α) to achieve a desired setpoint (e.g., abstention fraction). The most familiar example of a PID controller is a car's cruise control.

Thulasidasan (2020) solely explores low abstention setpoints (e.g., 10%), and evaluates the PID terms batch by batch. For our applications, however, we need the algorithm to work well for a broad range of abstention setpoints (e.g., from 10% to 90%). With a high abstention setpoint, say 90%, and a batch size of 32, only 3 samples on average would be covered per batch—this leads to unstable behavior. Because of this, we evaluate the PID terms on six consecutive batches (32 × 6 = 192 samples) which leads to more stable behavior of abstention fraction while not being so big as to impede training. Figure 5 shows examples of the PID controller modulating α to control the abstention setpoint during training.

The training of the CAN occurs in two stages:
  • • Spin-up: For the first Nspin epochs, the CAN is trained using the baseline loss function given in Equation 2. At the end of spin-up, urn:x-wiley:19422466:media:jame21502:jame21502-math-0013 is computed on the validation samples for m between 10 and 90 in increments of 10.

  • • Abstention training: The CAN continues from where it stopped during the spin-up stage, but now trains using the abstention loss of Equation 4, with κ and τ defined from urn:x-wiley:19422466:media:jame21502:jame21502-math-0014. During this stage, α is either updated by the PID controller, or held constant at a user-defined value.

Based on these stages of training, there are only one to two new free parameters to be determined by the user, depending on whether the PID controller is used to update α or whether α is held constant. Specifically, the user must choose the number of spin-up epochs, Nspin, for both methods and must also choose α if it is held fixed. While other parameters can certainly be tuned, we did not find it necessary for the range of experiments included in this paper.

In the case that the user wishes to fix the abstention setpoint (i.e., allow the PID controller to update α), this choice of setpoint can impact both the overall accuracy of the CAN, as well as the extent to which the CAN is an improvement over the baseline ANN (as we will show). The choice of abstention setpoint, however, is often not very clear. If the user has a strong prior assumption as to the fraction of samples that may be extra difficult to predict, this is a good place to start. In the absence of this a priori knowledge, our suggestion is to train multiple CANs for various setpoints and observe the CAN's behavior for each before settling on one particular value.

4 Results

4.1 A Simple 1D Example

Before we discuss results with the synthetic climate data, it is informative to explore the behavior of the baseline ANN and CAN for a simple example with a 1-dimensional input. Specifically, we define an (x, y) data set as (Figure 2a):
urn:x-wiley:19422466:media:jame21502:jame21502-math-0015(6)
where ϵ(a, b) denotes a random variable drawn from a normal distribution with mean a and standard deviation b. The data is created such that 20% of the samples exist along the line (i.e., (xl, yl)), and 80% of the points exist within the cloud (i.e., (xc, yc)). Figure 2a shows the data with x on the x-axis and y on the y-axis. The data is designed such that for x less than about 2.5, the data largely follows a straight line with little noise. For larger x, the data shows a cloud of points with no clear linear relationship. Naively fitting a straight line through all of this data would result in a fit that performs poorly on most samples. Instead, we would like a network to predict the samples along the line with accuracy while also identifying the samples within the cloud as being highly unpredictable (“I don't know.”).
Details are in the caption following the image

1D example w/constant α. (a) Data used for the simple 1D example. (b) Predicted y versus the true y for the baseline abstention neural network predictions. The dashed line denotes the one-to-one line—a perfect prediction. Panel (c) same as panel (b), but for the controlled abstention network predictions. Scatter plots only show covered predictions (i.e., non-abstained). Colors denote the predicted σ, and insets in panels (b) and (c) display histograms of the predicted σ for both covered and abstained predictions. (d) Mean absolute error versus coverage for different neural network loss functions over a range of initialization seeds for constant α = 0.1. Purple shading denotes the full range of errors over 20 baseline ANN models; the solid purple line denotes the median.

The network is trained to take the input value xi and predict yi. For this 1D simple example only, we train a fully connected network with 2 hidden layers of 5 units each. We found that this architecture is complex enough to learn the linear fit but not so complex as to learn a separate fit for the cloud. The network is trained with a constant α to evaluate whether the CAN is able to identify the correct coverage fraction of 20% and abstain on the remaining 80%. We found that α = 0.1 works well. We set the number of spin-up epochs to Nspin = 225 and use a learning rate of 0.000 1. Finally, we train on 3,000 samples, validate on 1,000 samples, and test on 1,000 samples.

Figure 3 shows α (fixed to 0.1 after spin-up), the abstention fraction, and the loss as a function of epoch during training for one particular model. The loss of both the training and validation data drops steadily during the spin-up stage of 0–225 epochs. At the start of the abstention stage, α is fixed to 0.1, while the abstention fraction is allowed to vary. However, it is clear that the network identifies an optimal abstention fraction by the second epoch of the abstention stage which does not vary for the rest of the training. Training is halted by early stopping and the best weights are taken from the best model at epoch 559.

Details are in the caption following the image

1D example w/constant α. Example training and validation metrics for a constant α = 0.1.

Results from the baseline ANN and the CAN are shown in Figures 2b–2d. As shown in Figures 2b and 2d, the baseline ANN outperforms the MAE model for all coverage percentages. Unlike the MAE model, the baseline ANN learns which samples are more certain and scales its predicted σ accordingly, as shown by the inset histogram in Figure 2b. Figure 4 shows the histograms of the standardized errors zi for the baseline ANN, which are defined as,
urn:x-wiley:19422466:media:jame21502:jame21502-math-0016(7)
Details are in the caption following the image

1D example w/constant α. Histograms of the standardized errors (z-scores) of predictions by the baseline abstention neural network for all samples. Means and standard deviations of these standardized errors are shown in colored text.

The mean and standard deviation of the zi are approximately 0 and 1 for both training and validation. This reveals that the σ are more than just unscaled measures of relative confidence. Rather, we may usefully interpret μi and σi as the mean and standard deviation of an approximate conditional probability distribution for prediction i.

Focusing more closely on the baseline ANN results in Figure 2d (solid purple line), the error decreases with the coverage percent. Recall that while the baseline ANN makes a prediction for every sample, we can use the predicted σ to threshold the predictions. Thus, a coverage setpoint of x% for the baseline ANN implies that we have evaluated the baseline error on only the x% smallest σ values. Decreasing error with coverage indicates that the more confident predictions are also more correct. As mentioned in the introduction, this is the idea behind forecasts of opportunity, and the baseline ANN alone is able to identify the most skillful forecasts without abstention. Even so, the CAN (orange dots) outperforms the baseline ANN slightly: its error is slightly below even the best baseline ANN model and does a slightly better job learning the best fit line (Figure 2c). The CAN obtains its edge over the baseline ANN because it is able to put even more energy into learning the relationships of the confident samples because of the abstention loss design. Furthermore, recall that 20% of the data falls along the well-defined line in Figure 2a, and the CAN is able to identify the optimal coverage percent as 19%.

4.2 Forecasts of Opportunity

For our first use case with the synthetic climate data, we modify the data to loosely reflect forecasts of opportunity related to teleconnections associated with the ENSO. Warm ENSO events (El Niño events) have long been known to impact global temperatures and precipitation (e.g., McPhaden et al., 2006; Yeh et al., 2018). At times these events have led to skillful forecasts on subseasonal-to-seasonal time scales (e.g., Johnson et al., 2014). To mimic this behavior with our synthetic data set, we average the anomalous SSTs in the ENSO region within the equatorial eastern Pacific (dashed white box in the map in Figure 1). When the average value in this box is larger than 0.5 (29% of the samples), we leave the sample as is. This reflects an opportunity where a strong El Niño may lead to more predictable behavior of the global climate system. Samples where the average value is less than 0.5 represent “noisy” samples consequently, we shuffle the y values across these samples so that there is no relationship between the input maps x and their labels y. With such a setup, we anticipate that the network can identify strong synthetic El Niño samples (i.e., large values within the ENSO box, Figure 1) as samples with high confidence and low error.

We train separate models for abstention setpoints ranging from 0.1 to 0.9 in increments of 0.1. Figure 5 shows α, the abstention fraction, and the loss as a function of epoch during training for two different abstention setpoints. Following the spin-up period of 15 epochs, the PID controller adjusts α to maintain an abstention fraction within 0.1 of the abstention setpoint.

Details are in the caption following the image

Forecasts of opportunity experiment w/proportional-integral-derivative-controlled α. Example training and validation metrics for abstention setpoints of (a) 0.3 and (b) 0.7.

Results for the baseline ANN and PID-controlled CAN are shown in Figure 6. The top three panels (Figures 6a–6c) displays predictions for individual testing samples by a single network, where the color of the dot reflects the predicted σ by the networks. The bottom panel (Figure 6d) summarizes the prediction errors across many trained networks, and serves as a comparison between the baseline ANN and the CAN. As shown in Figure 6d, the baseline ANN error (purple shading) decreases for decreasing coverage. This documents the ability of the baseline ANN to identify the forecasts of opportunity while it assigns higher σ values to samples with higher uncertainty (Figure 6a). The colored dots in Figure 6d show results from the PID-controlled CAN for a range of abstention setpoints. Like the baseline ANN, the CAN error decreases with decreasing coverage; however, the best CAN models are always better (lower error) than the best baseline ANN models. This is especially evident for coverage fractions below 30%, which corresponds to the 29% of samples that are forecasts of opportunity (i.e., unshuffled). Figures 6b and 6c display the predictions by the CAN, including histograms of σ, for two coverage fractions. For lower coverage fractions (higher abstention fractions), the CAN pushes the abstained σ values to larger values and likewise improves its confidence on the covered samples by reducing σ (compare predicted σ histograms inset in Figures 6b and 6c). That is, the CAN with 24% coverage learns the forecasts of opportunity samples better than the baseline ANN, and better than it does for higher coverage fractions.

Details are in the caption following the image

Forecasts of opportunity experiment w/proportional-integral-derivative-controlled α. (a) Predicted y versus the true y for the baseline abstention neural network (ANN) predictions. The dashed line denotes the one-to-one line—a perfect prediction. Panels (b) and (c) are the same panel (a), but for controlled abstention network predictions at two different coverage rates. Scatter plots only show covered predictions (i.e., non-abstained). Colors in panels (a–c) denote the predicted σ, and insets in panels (a–c) display histograms of the predicted σ for both covered and abstained predictions. (d) Mean absolute error versus coverage for different neural network loss functions over a range of initialization seeds and abstention setpoints (shown in colors). Purple shading denotes the full range of errors over 20 baseline ANN models; the solid purple line denotes the median.

Figure 7 shows the histograms of the standardized errors zi from the baseline ANN (see Equation 7). As in the 1D example, the mean and standard deviation of the zi are approximately 0 and 1 for both training and testing data (validation data looks similar). This reveals that the σ are more than just unscaled measures of relative confidence. Moreover, we may usefully interpret μi and σi as the mean and standard deviation of an approximate conditional probability distribution for prediction i. One can also create histograms for the CAN of the predicted samples (not shown); however, in this case the histograms are much narrower since the covered (non-abstained) samples tend to be highly confident and exhibit small σ, as expected.

Details are in the caption following the image

Forecasts of opportunity experiment w/proportional-integral-derivative-controlled α. Histograms of the standardized errors (z-scores) of the predictions by the baseline abstention neural network for all samples. Means and standard deviations of these standardized errors are shown in colored text.

Thus far, we have trained the CAN to identify synthetic El Niño forecasts of opportunity with the PID controller, which sets the abstention setpoint during training. We can instead use the constant α approach to see if the CAN identifies the correct abstention fraction. We set α = 0.1; the results are shown in Figure 8. The CAN outperforms the baseline ANN with constant α, as it did with the PID controller. In addition, it identifies a coverage of urn:x-wiley:19422466:media:jame21502:jame21502-math-0017%, which is very close to the 29% forecasts of opportunity samples.

Details are in the caption following the image

Forecasts of opportunity experiment w/constant α. (a) Predicted y versus the true y for the baseline abstention neural network (ANN) predictions. The dashed line denotes the one-to-one line—a perfect prediction. Panel (b) same as in panel (a), but for controlled abstention network predictions at a coverage rate of 24%. Scatter plots only show covered predictions (i.e., non-abstained). Colors in panels (a) and (b) denote the predicted σ, and insets in panels (a) and (b) display histograms of the predicted σ for both covered and abstained predictions. (c) Mean absolute error versus coverage for different neural network loss functions over a range of initialization seeds for constant α = 0.1. Purple shading denotes the full range of errors over 20 baseline ANN models; the solid purple line denotes the median.

Interestingly, we find that the PID controller method tends to slightly outperform the constant α approach (compare the 25% coverage errors between Figures 6d and 8c). It is unclear to the authors why this is the case; it could be a function of this synthetic data set. Future work will explore this behavior further.

4.3 Corrupt Inputs

For the second synthetic use case, we modify the climate input maps by “corrupting” some of the grid points by setting them equal to −4.0. This exercise is meant to mimic a data set where some of the inputs have bad pixels in some areas. An example of this is shown in Figure 9. We corrupt 30% of the samples and leave the remaining 70% unmodified. We use the CAN with constant α = 0.05 to assess whether the network is able to successfully identify the correct abstention fraction of 30%. Results are shown in Figure 10. Once again, the baseline ANN outperforms the standard MAE model for coverages less than 100%. Furthermore, the CAN outperforms the baseline ANN and correctly identifies 70% coverage (30% abstention) as the optimal fraction.

Details are in the caption following the image

Corrupt inputs experiment. Examples of panel (a), an unmodified input map, and (b), a corrupted input map where 66% of the pixels have been set to −4.0.

Details are in the caption following the image

Corrupt inputs experiment w/constant α. (a) Predicted y versus the true y for the baseline abstention neural network (ANN) predictions. The dashed line denotes the one-to-one line—a perfect prediction. Panel (b) same as in panel (a), but for controlled abstention network predictions at a coverage rate of 24%. Scatter plots only show covered predictions (i.e., non-abstained). Colors in panels (a) and (b) denote the predicted σ, and insets in panels (a) and (b) display histograms of the predicted σ for both covered and abstained predictions. (c) Mean absolute error versus coverage for different neural network loss functions over a range of initialization seeds for constant α = 0.05. Purple shading denotes the full range of errors over 20 baseline ANN models; the solid purple line denotes the median.

This use case demonstrates the ability of the CAN to act as a “data cleaner” for regression problems (Thulasidasan et al., 2019); the CAN preferentially learns on the uncorrupted samples and abstains on the corrupted ones. Note that if we had corrupted samples in the training set only (not in the testing set), we could remove these corrupted samples prior to training to obtain a model that performs well on the clean data set. This is different than what we have done here. We have trained the network to not only learn the uncorrupted samples, but to also learn to identify the corrupted samples. This means that in the future, when new, unseen samples are pushed through the network, the network will be able to handle them accordingly whether they are corrupted or not.

5 Discussion

In many ways, abstention loss is yet another approach to combat overfitting, if we think broadly of overfitting as incorrectly learning “noise” within the training samples that is not present in the validation or testing samples. Common approaches for dealing with overfitting include dropout (Srivastava et al., 2014) and regularization. To explore this a bit further, we reran our forecast of opportunity experiment shown in Figure 6 but applied ridge regression with an L2 parameter of 0.1 (Marquardt & Snee, 1975) to the first layer of the network. Ridge regression reduces the magnitude of individual weights, and thus spreads the importance across multiple units (see Barnes, Toms, et al., 2020, Figure 3). Results, shown in Figure S1, can be directly compared with those in Figure 6. Regularization slightly reduces both the baseline ANN and CAN errors, and allows the baseline ANN to perform more similarly to the CAN. Even so, the CAN outperforms the baseline ANN for the lowest coverage fractions, consistent with the fraction of noisy samples within the synthetic data set. Overall, we see that for this specific use case, regularization can be paired with the abstention loss to produce an even better prediction.

Results presented here were based on the synthetic climate data of Mamalakis et al. (2021), where each sample is independent and the input and output values are largely symmetric about zero. However, real data seldom behave so well. It is likely that real data may require a transformation (e.g., standardization or a power transformation) prior to training if we are to interpret μi and σi as the mean and standard deviation of an approximate conditional probability distribution for prediction i. Furthermore, a potential concern is that we only present use cases based on synthetic climate data. Our aim in this paper is to demonstrate the basic concept and implementation of the abstention loss in a setting where the correct answer is known. This leaves exploration of the CAN's utility in specific scientific contexts to future research. For example, previous work using neural networks to identify climate forecasts of opportunity on subseasonal-to-decadal timescales could be extended by taking an abstention approach (e.g., Barnes, Mayer, et al., 2020; Gordon et al., 2021; Mayer & Barnes, 2021; Tseng et al., 2020).

As presented here, the CAN learns to abstain on hard to predict samples by learning to identify particular features within the inputs that indicate a difficult prediction (so the network abstains). One could imagine a very different “abstention network” that instead is tasked to say “I Don't Know” on out-of-sample inputs. That is, a network that can identify when particular inputs are nothing like anything it has seen during training and then abstain on the prediction. While such a network would certainly be applicable to earth science applications, for example, for predicting phenomena under a future warmer climate (e.g., Rasp et al., 2018), it is not how we have approached abstention here.

While we have shown that the abstention loss outperforms the baseline ANN approach, we wish to stress that this baseline approach (nonlinear heteroscedastic regression) is itself a simple yet powerful method for incorporating uncertainty into neural network regression problems. This is especially true because the output offers approximate conditional probability distributions for the predictions. Although this baseline approach is a standard in the computer science literature (e.g., Duerr et al., 2020, Chapters 4 and 5), and has been previously used in the geosciences (e.g., Dorling et al., 2003; Guillaumin & Zanna, 2021), it much less known in the geoscience community (see Barnes, Barnes, & Gordillo, 2021 for a basic overview). The authors believe it will be a simple, powerful, and widely used tool as we move forward.

6 Conclusions

The ability to say “I don't know” is an important skill for any scientist.

In the context of prediction with deep learning, the identification of uncertain (unpredictable) samples is often approached post-training. In this paper, we propose an alternative: a deep learning loss function that can abstain during training for regression problems. We first present a baseline regression approach and then introduce a new abstention loss for regression. The abstention loss CAN allows the network to preferentially learn more from confident samples, and ultimately outperform the baseline ANN approach.

An additional benefit of both the baseline ANN and abstention loss CAN is their simplicity—they are straightforward to implement in most any network architecture as they only require modification of the output layer and training loss. The abstention loss framework has the potential to aid deep learning algorithms to identify skillful forecasts, as well as corrupt samples, ultimately improving performance on the samples with predictability.

Acknowledgments

The authors wish to thank the editor, associate editor, and two anonymous reviewers for helping us improve the paper. This work was funded in part by the NSF AI Institute for Research on Trustworthy AI in Weather, Climate, and Coastal Oceanography (AI2ES) under NSF grant ICER-2019758.

    Data Availability Statement

    The abstention network code is available at Barnes and Barnes (2021b) and the synthetic climate data set analyzed here can be accessed at Barnes, Barnes, and Mamalakis (2021).