1 Introduction

As deep learning penetrates more and more application areas, there is a natural demand to adapt deep learning techniques to area and task-specific requirements and constraints. An immediate consequence of this is the expectation to perform well with respect to task-specific performance measures. However, this can be challenging, as these performance measures can be quite complex in their structure and be motivated by legacy, rather than algorithmic convenience. Examples include the F-measure that is popular in retrieval tasks, various ranking performance measures such as area-under-the-ROC-curve, and the Kullback–Leibler divergence that is popular in class-ratio estimation problems.

Optimizing these performance measures across application areas has proved to be challenging even when learning linear models, as is evidenced by the recent surge in methods for optimizing “non-decomposable” loss functions for learning linear models, as we review in Sect. 2. The challenge becomes doubly hard when trying to do so while training neural network architectures such as multi-layer perceptrons and convolutional or recurrent neural networks.

The vast majority of contemporary training techniques for neural networks consist of using simple per-sample loss functions such as least-squares loss or cross-entropy. While their use has allowed research directions to focus more on developing more evolved network architectures, as well as developing highly optimized implementations of training routines on GPU architectures, we show that this is suboptimal and that a sound effort towards training with task-specific loss functions pays off handsomely.

In particular, we will show how to train neural networks to directly optimize (smoothed versions of) non-decomposable performance measures such as F-measure, Q-mean and Kullback–Leibler divergence which are popularly used in binary classification with label imbalance and quantification (or class ratio estimation) settings. The problem of quantification expects accurate estimation of relative prevalence of class labels (e.g., fraction of positive vs negative reviews for some product or some campaign) and is useful in social engineering and epidemiology. We will apply our techniques to label imbalanced learning problems on several benchmark datasets as well as a sentiment analysis challenge task on Twitter feeds. We will find that in general, our proposed methods offer superior performance, while utilizing fewer samples and training time.

2 Related work

The recent years have seen much interest, as well as progress, in training directly with task-specific performance measures in the field of classification and ranking.

2.1 Works on non-decomposable measures with linear models

  1. 1.

    Koyejo et al. (2014) and Narasimhan et al. (2014) use plug-in classifiers for optimizing non-decomposable performance measures for binary classification such as F-measure and Q-mean. These methods first learn a class probability prediction model (typically using logistic regression) and then finely tune a threshold to get a classifier.

  2. 2.

    Kar et al. (2013, 2016), Narasimhan and Agarwal (2013b) and Narasimhan et al. (2015) use stochastic gradient descent-style algorithms for optimizing non-decomposable performance measures for binary classification and quantification problems such as F-measure, Q-mean, KL-divergence etc.

  3. 3.

    Kar et al. (2014, 2015) and Narasimhan and Agarwal (2013a, b) focus on optimizing ranking loss functions such as (partial) area under the ROC curve (pAUC), and precision at the top (prec@k).

However, all the above works focus only on training linear models. Although this allows for simple algorithms and detailed analyses and theoretical guarantees, the approaches do not directly extend to deep networks (we will discuss the challenges shortly). Algorithms for deep learning which directly optimize non-decomposable performance measures are relatively less explored.

This can be attributed to the use of back-propagation style algorithms for training neural networks which require calculating model gradients with respect to a batch of training points. With decomposable loss functions, where it is possible to define the gradient of the performance measure with respect to a single data point, this can be cheaply done – the batch gradient is simply the mean of gradients with respect to individual data points in that batch. However, the same becomes infeasible for non-decomposable performance measures (for example, it is meaningless to talk about the F-measure of a classifier on a single data point).

2.2 Works on non-decomposable measures with deep models

We are aware of a few efforts at training deep networks with non-decomposable losses. Most of these works focus on ranking tasks.

  1. (1)

    Song et al. (2016) train neural networks for ranking tasks with mean average precision (MAP) as the performance measure, by showing that for nicely behaved ranking loss functions, the expected gradient of the loss function with respect to the network weights can be expressed in terms of loss functions such as least squares loss.

  2. (2)

    Eban et al. (2017) train neural networks for ranking objectives e.g. area under the precision-recall curve and precision at a fixed recall rate.

  3. (3)

    Schäfer and Hüllermeier (2018) train neural networks for label ranking tasks based on the Plackett–Luce model where label features are available.

3 Our contributions

Our work advances the state of the art in training neural networks on a wide variety of non-decomposable performance measures for classification and quantification tasks. This complements existing works on ranking tasks.

  1. (1)

    We show how to train neural networks directly using (smoothed versions of) performance measures that are concave, pseudo-linear, or nested concave functions including F-measure, Q-mean, KL-divergence and many others.

  2. (2)

    We offer formal stabilization guarantees for all our algorithms, i.e., we guarantee that our algorithms rapidly approach a stable point where the effective gradients of the algorithm vanish.

  3. (3)

    Our methods offer far superior performance than traditional cross-entropy based training routines—on an F-measure maximization task on a benchmark dataset a9a, our method achieves an F-measure of around 0.68 in less than 10 mini-batch iterations whereas it takes traditional cross-entropy based training more than 80 iterations to reach similar performance levels.

  4. (4)

    Our methods also outperform popular techniques used to deal with label imbalance such as cost-weighted classification, plug-in methods, and loss-augmented inference methods using structural surrogates—on a min-TPR/TNR maximization task on a benchmark dataset IJCNN, loss-augmented inference is only able to offer a performance level of 0.55 whereas our technique is able to reach performance over 0.95 in very few iterations.

  5. (5)

    We apply our techniques to a Twitter sentiment analysis challenge task using an end-to-end recurrent neural network with attention model and achieve near perfect quantification scores using a substantially smaller number of training iterations as compared to standard cross-entropy based training.

Our algorithms are readily adapted to neural architectures such as multi-layered perceptrons and recurrent networks, as well as easily integrated into popular symbolic gradient frameworks such as TensorFlow, PyTorch, and Theano.

3.1 Our contributions in the context of related work in deep models

  1. (1)

    Song et al. (2016), Eban et al. (2017) and Schäfer and Hüllermeier (2018) are focused on ranking measures whereas our work addresses binary classification and quantification (class-ratio estimation) measures.

  2. (2)

    Song et al. (2016) uses loss-augmented inference techniques (Joachims et al. 2009; Tsochantaridis et al. 2005) which are very well established in machine learning. We apply the same techniques to our performance measures (see the STRUCT-ANN benchmark in our experiments) and show that our precise primal dual techniques far outperform methods based on loss-augmented inference.

  3. (3)

    Although Eban et al. (2017) does consider the F-measure which we also study, they do not report any experimentation with F-measure. A possible reason for this might be that their algorithm requires a constrained optimization problem to be solved that is challenging over deep networks. Our methods generalize to a large number of widely used measures, e.g., H-mean, G-mean, Jaccard coefficient, Q-measure, etc. which Eban et al. (2017) do not claim to handle.

  4. (4)

    Song et al. (2016), Eban et al. (2017) and Schäfer and Hüllermeier (2018) do not offer any convergence guarantees for their proposed algorithms whereas we do offer stabilization guarantees for our methods.

3.2 Our contributions in the context of related work in linear models

We note that our methods do adapt techniques that were earlier proposed for training linear models, such as in Narasimhan et al. (2015). However, we stress that this adaptation is neither trivial, nor immediate, and constitutes an independent contribution.

  1. (1)

    Previous works, such as Narasimhan et al. (2015) only consider linear models which lead to convex problems. A naïve and direct application of existing techniques to deep networks yields poor experimental results. For example, we could only obtain good accuracies on the challenging F-measure after we adapted existing methods to carefully use pretraining techniques.

  2. (2)

    Previous works consider only with convex surrogate functions since they desire the overall training problem to remain convex. We found this to give poor results with deep neural networks. We found much superior performance upon the use of (non-convex) surrogates such as the sigmoidal.

  3. (3)

    Despite working with non-linear models such as deep networks, and using non-convex surrogates such as the sigmoid, we are still able to provide stabilization guarantees for our algorithms. In fact, for F-measure and other pseudo-linear measures, our methods converge to a stationary point of the (smoothed) performance measure. Existing theoretical analyses on linear methods rely on convexity and cannot give such results.

4 Problem setting

We will consider binary classification and quantification problems in this work. Let \(\mathcal {X}\subset \mathbb {R}^d\) be the space of feature vectors and \(\mathcal {Y}= \left\{ {-\,1,+\,1}\right\} \) be the label set. The training data set \(S = (\mathbf {x}_i,y_i)_{i = 1,\ldots ,n}\) shall be sampled i.i.d. from some fixed but unknown distribution \(\mathcal {D}\) over \(\mathcal {X}\times \mathcal {Y}\). The proportion of positives in the population and sample S will be denoted by \(p = \mathop {\mathbb {P}}\limits _{(\mathbf {x},y)\sim \mathcal {D}}[y = +1]\) and \(\hat{p}_S\) respectively.

In sharp contrast to most previous works on non-decomposable performance measures that consider only linear models, we concentrate on non-linear models, especially those induced by neural networks. Let us fix the neural architecture (layers, activations, connectivity) and let \(\mathcal {W}\) denote the space of all models (weights on the network edges). We will train a neural model, whose edge weights are indexed by \(\mathbf {w}\in \mathcal {W}\), to assign a score to every data point \(\mathbf {x}\in \mathcal {X}\) (that can be converted into a binary label by thresholding, to class probability estimates by applying a transfer function, etc). Linear models assign a score by simply computing \(\left\langle {\mathbf {w}},{\mathbf {x}}\right\rangle \).

We will use a more general notation \(f(\mathbf {x};\mathbf {w})\) (or equivalently \(f_\mathbf {w}(\mathbf {x})\) as a shorthand) to denote the score given to the data point \(\mathbf {x}\) by the neural model indexed by the weights \(\mathbf {w}\). The function f can be seen as encoding all the neural connections and activations. We stress that the function f is neither convex nor concave. This lack of structure in the scoring function precludes a large body of work in linear multivariate optimization and quantification from being applied to deep models.

Table 1 List of performance measures \(\varPsi (P,N)\). We use the shorthand notation \(P \equiv \text {TPR}, N \equiv \text {TNR} \). pn denote the proportion of positives and negatives in the training dataset

Performance measures We consider performance measures that can be expressed in terms of the true positive rate (TPR—fraction of positively labeled points correctly labeled as positive) and true negative rate (TNR—fraction of negatively labeled points correctly labeled as negative) of the model. We will consider three general classes of performance measures described below. In our experiments, we will present results on a selection of these performance measures listed in Table 1. Below we recall the definitions of these families from Kar et al. (2016) and Narasimhan et al. (2015).

Pseudo-linear performance measures These measures can be written as a ratio of two linear functions of the TPR and TNR values, i.e. they have a fractional linear link function. More specifically, given given coefficients \(\mathbf {a},\mathbf {b}\in \mathbb {R}^3\),

$$\begin{aligned} \mathcal {P}_{(\mathbf {a},\mathbf {b})}(\mathbf {w}) = \frac{a_0 + a_1\cdot \text {TPR} (\mathbf {w}) + a_2\cdot \text {TNR} (\mathbf {w})}{b_0 + b_1\cdot \text {TPR} (\mathbf {w}) + b_2\cdot \text {TNR} (\mathbf {w})}. \end{aligned}$$

The popularly used F-measure (Manning et al. 2008) is actually a pseudo-linear performance measure in terms of the TPR, TNR values of a model although it is more commonly represented as the harmonic mean of precision and recall. Other members include the Jaccard coefficient and the Gower–Legendre measure.

Concave performance measures These measures can be written as a concave function of the TPR and TNR values:

$$\begin{aligned} \mathcal {P}_\varPsi (\mathbf {w}) = \varPsi \left( {\text {TPR} (\mathbf {w}),\text {TNR} (\mathbf {w})}\right) \end{aligned}$$

for some concave link/wrapper function \(\varPsi : \mathbb {R}^2 \rightarrow \mathbb {R}\). These measures are popular in cost-sensitive classification in cases with severe label imbalance, for example detection theory (Vincent 1994). A commonly used member of this family is the so-called Min-function assigns the value \(\min \left\{ {\text {TPR} (\mathbf {w}),\text {TNR} (\mathbf {w})}\right\} \) to a model \(\mathbf {w}\). Note that this compels the model to pay equal attention to both classes. Other examples include the Q-mean and H-mean measures.

Nested concave performance measures Recent works, e.g., Barranquero et al. (2015) and Kar et al. (2016) in areas such as quantification and class ratio estimation, have brought focus on performance measures that can be written as concave combinations of concave performance measures. More formally, given three concave functions \(\varPsi , \zeta _1, \zeta _2: \mathbb {R}^2 \rightarrow \mathbb {R}\), we define a performance measure

$$\begin{aligned} \mathcal {P}_{(\varPsi , \zeta _1, \zeta _2)}(\mathbf {w}) = \varPsi (\zeta _1(\mathbf {w}),\zeta _2(\mathbf {w})), \end{aligned}$$

where \(\zeta _i(\mathbf {w}) := \zeta _i(\text {TPR} (\mathbf {w}),\text {TNR} (\mathbf {w})), i = 1,2\). A popular measure for quantification tasks is the KLD: Kullback–Leibler Divergence (Barranquero et al. 2015; Esuli and Sebastiani 2015; Gao and Sebastiani 2015; Kar et al. 2016) which can be shown to be a sum of concave functions of the TPR and TNR. If \(\mathbf {p}\in \mathbb {R}^2\) is the vector of true class priors for a binary classification task and \({\hat{\mathbf {p}}}\) an estimate thereof, then

$$\begin{aligned} \text {KLD} (\mathbf {p},\hat{\mathbf {p}})&= \sum _{y\in \mathcal {Y}}p(y)\log \frac{p(y)}{\hat{p}(y)} \end{aligned}$$
(1)

\(\text {KLD} (p,\hat{p}) = 0\) indicates perfect quantification.

There are several other performance measures that our techniques can handle but which we do not discuss here due to lack of space. These include measures for class-imbalanced classification such as H-mean, G-mean, Jaccard coefficient (Narasimhan et al. 2015), as well as quantification measures such as Q-measure, NSS and CQB (Kar et al. 2016).

Smoothing the performance measures Since TPR and TNR are count-based measures, they are unsuitable for numerical optimization algorithms. Thus, it is very common in learning and optimization literature to use surrogate reward/loss functions to stand in for them. Some approaches (Narasimhan et al. 2015) use point/decomposable surrogates whereas others (Narasimhan and Agarwal 2013a; Kar et al. 2015; Song et al. 2016) use non-decomposable surrogates.

We will use point surrogate reward functions to replace TPR, TNR in our objectives. A reward function r assigns a reward \(r(\hat{y},y)\) when the true label is \(y \in \mathcal {Y}\) but the prediction is \(\hat{y} \in \mathbb {R}\). Given a reward function \(r: \mathbb {R}\times \mathcal {Y}\rightarrow \mathbb {R}\), a model \(\mathbf {w}\in \mathcal {W}\), data point \((\mathbf {x},y) \in \mathcal {X}\times \mathcal {Y}\), and scoring function f, we will use

$$\begin{aligned} r^+(\mathbf {w}; \mathbf {x}, y)&= \frac{1}{p}\cdot r(f(\mathbf {x};\mathbf {w}), y)\cdot \mathbb {I}\{{y = 1}\}\\ r^-(\mathbf {w}; \mathbf {x}, y)&= \frac{1}{1-p}\cdot r(f(\mathbf {x};\mathbf {w}), y)\cdot \mathbb {I}\{{y=-1}\}, \end{aligned}$$

to calculate rewards on positive and negative points (\(\mathbb {I}\{{\cdot }\}\) denotes the indicator function). For purpose of training, we may replace p with \(\hat{p}_S\) i.e. the proportion of positives in the data sample.Footnote 1

We will use the shorthand \(P(\mathbf {w})=\mathbb {E}{r^+(\mathbf {w}; \mathbf {x}, y)}\) to denote population average of the reward function and, given a sample of n data points \(S = (\mathbf {x}_i,y_i)_{i=1,\ldots ,n}\), denote the sample average as \(\hat{P}_S(\mathbf {w}) = \frac{1}{n}\sum _{i=1}^n r^+(\mathbf {w}; \mathbf {x}_i, y_i)\) and similarly define \(N(\mathbf {w}), \hat{N}_S(\mathbf {w})\). We will use \(\hat{P}_S(\mathbf {w})\) and \(\hat{N}_S(\mathbf {w})\) in place of TPR and TNR respectively while executing our algorithms.

Unlike previous work (Kar et al. 2016; Narasimhan et al. 2015), we will not restrict ourselves to concave surrogate reward functions. In particular we will utilize the sigmoidal reward, which is widely used as an activation function in neural networks is non-concave: \(r_{\text {sigmoid}}(\hat{y}, y) = (1+\exp (-y\cdot \hat{y}))^{-1}\). The reason for this is that the sigmoid resembles the misclassification loss function that lead to TPR and TNR, much more than the hinge loss or the logistic loss, and gave better performance. Many existing works preferred using losses such as the hinge and the logistic loss merely due to their convexity.

5 Deep optimization algorithms

The task of training deep models directly for non-decomposable performance measures is challenging due to several reasons: (1) these measures are non-decomposable and do not lend themselves to straightforward training methods such as gradient descent or back-propagation, (2) deep models do not offer the convenience of convexity, and (3) existing methods for optimizing such measures, e.g., Kar et al. (2016) and Narasimhan et al. (2015) fail to apply directly to deep models. We will see in Sect. 6 that a direct and naive application of traditional techniques yields poor results.

We will show how to overcome these challenges to arrive at scalable methods for training deep networks directly on (smoothed) non-decomposable measures. Let the procedure \(\text {NN-init} (d_\text {in},d_\text {out},\text {conf})\) initialize a neural network with \(d_\text {in}\) input nodes, \(d_\text {out}\) output nodes, and internal configuration (hidden layers, number of internal nodes, connectivity) specified by \(\text {conf}\). Please refer to Sect. 6 for details.

Intuition behind our Algorithms The basic intuition behind our algorithms is simple but useful, and one that has been used by previous works (Kar et al. 2016; Narasimhan et al. 2015) as well. We realize that training any model with respect to decomposable performance measures such as cross entropy or cost-weighted classification problems, is relatively simple, and can be accomplished using classical back-propagation. Thus, we convert the task of training with respect to a non-decomposable performance measure, say F-measure, into a sequence of tasks of training with respect to cost-weighted classification problems, with the costs being recalculated at each time step using methods specific to the performance measure at hand.

5.1 DAME: a deep learning technique for pseudo-linear performance measures

We present DAME (Algorithm 1), an algorithm for training deep models on pseudo-linear performance measures such as F-measure. We recall that although Eban et al. (2017) discuss F-measure optimization on deep models, they do not implement their algorithm or report experimental results with it, possibly since their method involves expensive constrained optimization problems involving deep networks. It will be our objective to maximize the following performance measure

$$\begin{aligned} \mathcal {P}_{(\mathbf {a},\mathbf {b}),S}(\mathbf {w}) = \frac{a_0 + a_1\cdot \hat{P}_S (\mathbf {w}) + a_2\cdot \hat{N}_S (\mathbf {w})}{b_0 + b_1\cdot \hat{P}_S (\mathbf {w}) + b_2\cdot \hat{N}_S (\mathbf {w})}, \end{aligned}$$

where \(\hat{P}_S (\mathbf {w}), \hat{N}_S (\mathbf {w})\) are the smoothed sample versions of TPR, TNR. Define \(\mathcal {P}_{\mathbf {a},S}(\mathbf {w}) := a_0 + a_1\cdot \hat{P}_S (\mathbf {w}) + a_2\cdot \hat{N}_S (\mathbf {w})\) and \(\mathcal {P}_{\mathbf {b},S}(\mathbf {w}) := b_0 + b_1\cdot \hat{P}_S (\mathbf {w}) + b_2\cdot \hat{N}_S (\mathbf {w})\). Our discussion will benefit from the notion of a valuation function from Narasimhan et al. (2015).

Definition 1

(Valuation function) Given coefficients \(a_0,a_1,a_2,b_0,b_1,b_2\), and a level \(v > 0\), define the valuation of a model at level v on training set S as

$$\begin{aligned} V_S(\mathbf {w},v) = \mathcal {P}_{\mathbf {a},S}(\mathbf {w}) - v\cdot \mathcal {P}_{\mathbf {b},S}(\mathbf {w}) = (a_0 - vb_0) + (a_1-vb_1)\cdot \hat{P}_S (\mathbf {w}) + (a_2-vb_2)\cdot \hat{N}_S (\mathbf {w}) \end{aligned}$$

Intuition behind DAME Notice that a model \(\mathbf {w}\) has good performance, i.e., \(\mathcal {P}_{(\mathbf {a},\mathbf {b})}(\mathbf {w}) > v\) for some \(v > 0\) if and only if \(\mathcal {P}_\mathbf {a}(\mathbf {w}) \ge v\cdot \mathcal {P}_\mathbf {b}(\mathbf {w})\) (F-measure and other performance measures always satisfy \(\mathcal {P}_\mathbf {b}(\mathbf {w}) > 0\)). This is equivalent to \((a_0 - vb_0) + (a_1-vb_1)\cdot \hat{P}_S (\mathbf {w}) + (a_2-vb_2)\cdot \hat{N}_S (\mathbf {w}) \ge 0\) which is equivalent to \(V_S(\mathbf {w},v) \ge 0\), i.e., the model has a non-negative valuation at level v.

Thus, one way to maximize \(\mathcal {P}_{(\mathbf {a},\mathbf {b})}\) is to simply check for various values of v if there exists any model \(\mathbf {w}\) such that \(V_S(\mathbf {w},v) \ge 0\) or not. The largest value v at which the answer is yes has to be the maximum achievable level of performance with respect to \(\mathcal {P}_{(\mathbf {a},\mathbf {b}),S}\) by the above argument. The work of Koyejo et al. (2014) does something similar. However, this may be very expensive since it is not clear which values v to query and how to find if the answer is yes or no.

DAME takes this intuition and turns it into a scalable algorithm. At every time step, it looks at the level \(v^t = \mathcal {P}_{(\mathbf {a},\mathbf {b}),S}(\mathbf {w}^t)\) since it is assured that at least one model, in particular \(\mathbf {w}^t\), does satisfy \(V_S(\mathbf {w},v) \ge 0\). Then it queries if some other model \(\mathbf {w}'\) can beat \(\mathbf {w}^t\) in terms of valuation at level \(v^t\). Note that the search for such as model \(\mathbf {w}'\) requires us to solve

$$\begin{aligned} \mathbf {w}' = \arg \max _{\mathbf {w}\in \mathcal {W}}\ V_S(\mathbf {w},v^t) = \arg \max _{\mathbf {w}\in \mathcal {W}}\ (a_1-v^tb_1)\cdot \hat{P}_S (\mathbf {w}) + (a_2-v^tb_2)\cdot \hat{N}_S (\mathbf {w}), \end{aligned}$$

which is simply a cost-weighted classification problem with the weights \((a_1-v^tb_1)\) and \((a_2-v^tb_2)\) on positives and negatives respectively. Having obtained an approximate maximizer of the above problem, say \(\mathbf {w}^{t+1}\), DAME resets the level to \(v^{t+1} = \mathcal {P}_{(\mathbf {a},\mathbf {b}),S}(\mathbf {w}^{t+1})\) and challenges a model to maximize \(V_S(\mathbf {w},v^{t+1})\). Note that the costs would have changed now for this new challenge. However, this strategy, which was implemented by Narasimhan et al. (2015), performs poorly on deep networks. The complex nature of pseudo-linear performance measures, that are neither convex nor concave, seem to make it more challenging to train deep models by solving the intermediate cost-weighted problems effectively.

To overcome this, DAME first performs a pretraining step where the entire network is trained using cost-weighted classification losses using standard mini-batch SGD. Let \((\mathbf {w}_1,\mathbf {w}_2)\) denote a stacking of the neural networks described by the models \(\mathbf {w}_1\) and \(\mathbf {w}_2\). More specifically \(\mathbf {w}_2\) denotes a network with input dimensionality \(d_\text {in}\) and output dimensionality \(d_\text {int}\) whereas \(\mathbf {w}_1\) denotes a network with input dimensionality \(d_\text {int}\) and output dimensionality \(d_\text {out}\).

In the pretraining step, both \(\mathbf {w}_1,\mathbf {w}_2\) are trained on a cost weighted classification problem where the cost on various classes is set naïvely to the inverse of their prevalence in the dataset, i.e., the cost of positives is \(1/\hat{p}_S\). After this, a fine-tuning step is performed where only the upper layers of the network, i.e., \(\mathbf {w}_1\) are trained to optimize the F-measure. To save time during the fine-tuning phase, instead of optimizing the intermediate cost-weighted classification problems completely, DAME simply executes a few iterations of back-propagation with respect to the cost-weighted problems. This two-stage procedure offers much more performance than training with respect to valuation functions right from the start.

Assumptions for theoretical convergence results We are able to show a first order stationarity guarantee for DAME under the following assumptions:

figure a
  1. (1)

    The negated valuation functions are L-strongly smooth functions (see definition of a strongly smooth function below) of the upper model \(\mathbf {w}_1\). Since the sigmoidal reward function is \(\frac{1}{2}\)-strongly smooth, this will be true for L that depends on the size of the network.

  2. (2)

    For some \(M \ge m > 0\), the performance measure satisfies \(\mathcal {P}_{\mathbf {a}}(\mathbf {w}) \le M\) and \(\mathcal {P}_{\mathbf {b}}(\mathbf {w}) \ge m\) for all \(\mathbf {w}\in \mathcal {W}\). We note that these assumptions are standard (Kar et al. 2016; Narasimhan et al. 2015) and readily satisfied by the F-measure, Jaccard coefficient, etc., for which we have \(m, M = \varTheta (1)\) (see Narasimhan et al. 2015). We will denote \(\kappa = 1 + M/m\).

  3. (3)

    The norm of the gradient of the reward function with respect to any single data point is bounded, i.e., \(\max \left\{ {\left\| {\nabla _\mathbf {w}r^+(\mathbf {w},\mathbf {x},y)}\right\| _2, \left\| {\nabla _\mathbf {w}r^-(\mathbf {w};\mathbf {x},y)}\right\| _2}\right\} \le r\) for any model \(\mathbf {w}\) and any data point \((\mathbf {x},y)\). This can be easily ensured by controlling the weights of the network and normalizing the data features \(\mathbf {x}\).

Definition 2

(Strongly-smooth function) We call a function \(f: \mathbb {R}^d \rightarrow \mathbb {R}\) strongly smooth if for all \(\mathbf {x},\mathbf {y}\in \mathbb {R}^d\), we have

$$\begin{aligned} f(\mathbf {x}) \le f(\mathbf {y}) + \left\langle {\nabla f(\mathbf {y})},{\mathbf {x}- \mathbf {y}}\right\rangle + \frac{L}{2}\left\| {\mathbf {x}- \mathbf {y}}\right\| _2^2 \end{aligned}$$

Theorem 1

If executed with a uniform step length \(\eta < \frac{2}{L\kappa }\) with a batch size B, DAME discovers an \(\epsilon + \mathcal {O}\left( {\frac{r}{m\sqrt{B}}}\right) \)-stable model within \(\mathcal {O}\left( {\frac{1}{\epsilon ^2}}\right) \) inner iterations with high probability. More specifically, within \(T \le \frac{\kappa ^2}{m}\frac{2}{\eta \left( {2 - L\kappa \eta }\right) \epsilon ^2}\) iterations, DAME identifies a model \(\mathbf {w}^{T,0}_1\) such that \(\left\| {\nabla _{\mathbf {w}}\mathcal {P}_{(\mathbf {a},\mathbf {b})}(\mathbf {w}^{T,0}_1)}\right\| _2 \le \epsilon + \mathcal {O}\left( {\frac{r}{m\sqrt{B}}}\right) \). If a mini-batch is not used and \(S_{t,i} = \tilde{S}\) for all time steps ti, then we are (deterministically) assured of an \(\epsilon \)-stable model, i.e., that \(\left\| {\nabla _{\mathbf {w}}\mathcal {P}_{(\mathbf {a},\mathbf {b})}(\mathbf {w}^{T,0}_1)}\right\| _2 \le \epsilon \) instead.

We present the proof in Appendix A. Since the pre-training phase enjoys local convergence guarantees by standard arguments, the argument is presented only for the fine-tuning phase and the lower network is omitted in the analysis.

5.2 DUPLE: a deep learning technique for concave performance measures

We present DUPLE (Algorithm 2), a scalable stochastic mini-batch primal dual algorithm for training deep models with concave performance measures. It will be our objective to maximize the following performance measure

$$\begin{aligned} \mathcal {P}_{\varPsi ,S}(\mathbf {w}) = \varPsi (\hat{P}_S (\mathbf {w}), \hat{N}_S (\mathbf {w})), \end{aligned}$$

where \(\hat{P}_S (\mathbf {w}), \hat{N}_S (\mathbf {w})\) are smoothed sample versions of TPR, TNR and \(\varPsi : \mathbb {R}^2 \rightarrow \mathbb {R}\) is a concave link function. We shall find it convenient to define the (concave) Fenchel conjugate of \(\varPsi \) in order to present the intuition behind our algorithms. For any concave function \(\varPsi : \mathbb {R}^2 \rightarrow \mathbb {R}\) and \(\alpha , \beta \in \mathbb {R}\), define

$$\begin{aligned} \varPsi ^*(\alpha , \beta ) = \inf _{u, v \in \mathbb {R}} \left\{ {\alpha u + \beta v - \varPsi (u, v)}\right\} . \end{aligned}$$
(2)

Intuition behind DUPLE The reason behind defining the Fenchel conjugate is that it can be shown that for all concave, upper semi-continuous functions, by the Fenchel-Moreau theorem, we have, for any \(u, v \in \mathbb {R}\),

$$\begin{aligned} \varPsi (u, v) = \inf _{\alpha , \beta \in \mathbb {R}} \left\{ {\alpha u + \beta v - \varPsi ^*(\alpha , \beta )}\right\} . \end{aligned}$$
(3)

This reformulation is advantageous since an application of Danskin’s theorem now allows us to take (sub) gradients with respect to \(\varPsi \). More formally, Danskin’s theorem tells us that if \(\tilde{\alpha },\tilde{\beta }\) are the minimizer values in (3) at a point \(u^0,v^0\), then \(\nabla _{u,v}\varPsi (u,v)|_{(u^0,v^0)} \ni (\tilde{\alpha },\tilde{\beta })\). This gives us a way to perform back-propagation-based training of neural networks with respect to concave performance measures: at every time step we can consider the current model \(\mathbf {w}^t\), evaluate \(\hat{P}_S (\mathbf {w}^t), \hat{N}_S (\mathbf {w}^t)\) using a forward pass over the entire training set S, solve the following problem

$$\begin{aligned} (\alpha ',\beta ') = \underset{\alpha , \beta \in \mathbb {R}}{\arg \inf }\left\{ {\alpha u + \beta v - \varPsi ^*(\alpha , \beta )}\right\} , \end{aligned}$$

and then apply chain rule to obtain

$$\begin{aligned} \nabla _\mathbf {w}\varPsi (\hat{P}_S (\mathbf {w}), \hat{N}_S (\mathbf {w}))|_{\mathbf {w}^t}&= \alpha '\cdot \nabla _\mathbf {w}\hat{P}_S (\mathbf {w})_{\mathbf {w}^t} + \beta '\cdot \nabla _\mathbf {w}\hat{N}_S (\mathbf {w})_{\mathbf {w}^t}\\&= \nabla _\mathbf {w}(\alpha '\cdot \hat{P}_S (\mathbf {w})_{\mathbf {w}^t} + \beta '\cdot \hat{N}_S (\mathbf {w})_{\mathbf {w}^t}) \end{aligned}$$

Notice that this is identical to taking a gradient step with respect to a cost-weighted classification problem with cost \(\alpha '\) on positives and cost \(\beta '\) on negatives. However, the process of discovering the optimal “dual variables” \((\alpha ',\beta ')\) this way is expensive since it requires a pass over the entire dataset S. DUPLE offers a much faster way to discover dual variables that, although not optimal w.r.t. \(\mathbf {w}^t\), nevertheless promote progress. The advantage gained here is that we can find these variables using a single mini-batch rather than going over the entire dataset.

The way DUPLE does this is by maintaining a proxy for the empirical reward values \(\hat{P}_S (\mathbf {w}^t), \hat{N}_S (\mathbf {w}^t)\) such that this proxy is very cheap to update. At every time step, we have a model \(\mathbf {w}^t\) and a mini-batch \(S_t\) being used to train further. DUPLE stores the values \(\hat{P}_{S_t}(\mathbf {w}^t), \hat{N}_{S_t}(\mathbf {w}^t)\) inside a running average. At any point, this running average is used as a cheap proxy for \(\hat{P}_S (\mathbf {w}^t), \hat{N}_S (\mathbf {w}^t)\).

figure b

We note that DUPLE draws upon the SPADE algorithm proposed in Narasimhan et al. (2015). However, its application to deep models requires non-trivial extensions.

  1. (1)

    SPADE performs SGD steps with individual data points since gradient updates are rapid with linear models. Doing so with neural models is expensive.

  2. (2)

    The theoretical analysis of SPADE assumes that the scoring and reward functions being used are concave functions of \(\mathbf {w}\). As noted in Sect. 4, for neural models, the scoring and reward functions are both non-concave.

  3. (3)

    Deep learning frameworks are highly optimized for gradient computations, assuming that the objective function with respect to which gradients are computed, is static across iterations (although this is not a drawback of some recent frameworks). SPADE, while applying the chain rule, ends up considering a different cost-weighted objective at each step.

DUPLE addresses the above issues and makes crucial design changes that make it highly optimized for use with deep networks.

  1. (1)

    DUPLE operates using cheap proxies that, at no point, require an pass over the entire dataset, or gradient computations on individual data points. Gradient computation is done only over mini batches which is a highly optimized operation in deep learning frameworks given that the batch sizes are not too large. We found this to also improve the stability properties of the algorithm.

  2. (2)

    At every time step, DUPLE requests gradients with respect to a cost-weighted objective function of the form \(g(\mathbf {w}; S, \alpha , \beta ) = \alpha \cdot \hat{P}_S(\mathbf {w}) + \beta \cdot \hat{N}_S(\mathbf {w})\). We exploit symbolic computation capabilities offered by frameworks such as Theano (Bergstra et al. 2010) to allow the scalars \(\alpha , \beta \) to be updated dynamically and train the network efficiently on a different objective function at each time step.

  3. (3)

    Our analysis for DUPLE makes absolutely no assumptions on the convexity/concavity of the reward and scoring functions. It only requires both functions \(r^+,r^-\) to be differentiable almost-everywhere. Thus, DUPLE only assumes the bare minimum to allow itself to take gradients.

We show the following convergence guarantee for DUPLE (see Appendix B) assuming that the negated surrogate reward function \(-r(f(\mathbf {x};\mathbf {w}),y)\) is \(L'\)-strongly smooth. This is satisfied by the sigmoidal reward function for some \(L'\) that depends on the size of the network. We nowhere assume the surrogate reward functions to be concave or convex. We will use the shorthand \(\nabla ^t = \nabla _\mathbf {w}g(\mathbf {w}^t; S_t, \alpha ^t, \beta ^t)\) and \(F(\mathbf {w}^t,\varvec{\alpha }^t) = g(\mathbf {w}^t; S_t, \alpha ^t, \beta ^t)\). The following result assures us that the DUPLE procedure will stabilize rapidly and not oscillate indefinitely.

Theorem 2

Consider a concave performance measure defined using a link function \(\varPsi \) that is concave and \(L'\)-smooth. Then, if executed with a uniform step length satisfying \(\eta < \frac{2}{L'}\), then DUPLE \(\epsilon \)-stabilizes within \(\tilde{\mathcal {O}}\left( {\frac{1}{\epsilon ^2}}\right) \) iterations. More specifically, within T iterations, DUPLE identifies a model \(\mathbf {w}^t\) such that \(\left\| {\nabla ^t}\right\| _2 \le \mathcal {O}\left( {\sqrt{L'\frac{\log T}{T}}}\right) \).

5.3 DENIM: deep learning with nested concave performance measures

We extend the DUPLE algorithm to performance measures that involve a nesting of concave functions. To reiterate, the KLD performance measure which is used extensively for quantification, falls in this category. These measures are challenging to optimize using DUPLE due to their nested structure which prevents a closed form solution for the Fenchel conjugates.

figure c

Footnote 2

To address this challenge, we present DENIM (Algorithm 3) that itself nests its updates to mimic the nesting of the performance measures. DENIM follows a similar principle as DUPLE and is based on the NEMSIS algorithm of Kar et al. (2016). However, the NEMSIS algorithm faces the same drawbacks as the SPADE algorithm and is unsuitable for training deep models. Due to the more complex nature of the performance measure, DENIM works with a slightly different cost-weighted objective function at each time step.

$$\begin{aligned} h(\mathbf {w}; S, {\varvec{\alpha }}, \varvec{\beta }, \varvec{\gamma }) = (\gamma _{1}\alpha _{1} + \gamma _{2}\beta _{1})\cdot \hat{P}_S(\mathbf {w}) + (\gamma _{1}\alpha _{2} + \gamma _{2}\beta _{2})\cdot \hat{N}_S(\mathbf {w}) \end{aligned}$$

However, similar to DUPLE, all these coefficients can be calculated using cheap to update proxies. Note also that DENIM performs inner and outer dual updates that are themselves nested. DENIM enjoys similar convergence results as DUPLE which we omit for lack of space.

6 Experimental results

Table 2 Statistics of datasets. Datasets are all binary and arranged in increasing order of the prevalence of the rare class. The most imbalanced dataset is KDD08 with under 1% positives and over 99% negatives. The most balanced is Cod-RNA with 33% positives and 64% negatives

We performed extensive evaluationFootnote 3 of DAME, DUPLE and DENIM on benchmark and real-life challenge datasets and found it to outperform both traditional techniques for training neural networks, as well as the more nuanced training techniques popularly used to handle label imbalance and non-decomposable measures such as cost-weighted classification, plug-in methods and loss-augmented inference. We note that loss-augmented inference has been used in the past (Song et al. 2016) for ranking measures. We show that our methods outperform loss-augmented inference as well.

We developed two variants of the DUPLE and DENIM algorithms in a manner similar to Kar et al. (2016). DUPLE-NS refers to a variant of the DUPLE algorithm that uses count based rewards instead of sigmoidal rewards. Recall that DUPLE collects the sigmoidal rewards obtained by the algorithm on positive and negative points that are used to set the dual variables. DUPLE-NS instead checks whether those data points were classified correctly or not, i.e., it collects count rewards. However, while computing gradients, we revert back to the use of the sigmoidal reward function. DENIMS-NS was constructed similarly.

Datasets We use the datasets listed in Table 2. Twitter refers to the dataset revealed as a part of the SEMEVAL 2016 sentiment detection challenge (Esuli 2016).

Competing methods We implemented and adapted several benchmarks in an attempt to critically assess the performance of our methods.

  1. (1)

    ANN 0-1 refers to a benchmark that trains a neural network using the cross-entropy loss function in an attempt to minimize the misclassification rate.

  2. (2)

    ANN-p refers to a benchmark that trains a neural network using a cost-weighted cross entropy loss function in an attempt to minimize cost-weighted misclassification loss. Each class is given a weight inversely proportional to its prevalence so as to give more emphasis to the rare class.

  3. (3)

    STRUCT-ANN refers to an implementation of loss-augmented inference and cutting plane techniques that have been popularly used in the past to optimize arbitrary non-decomposable performance measures. Please refer to Appendix C for the implementation details of this method.

  4. (4)

    ANN-PG refers to an implementation of the plug-in classification technique for F-measure as suggested in Koyejo et al. (2014).

Network architecture All neural networks being compared in a given experiment were given the same architecture. For experiments on concave and nested concave performance measures, all methods were provided with a network with 2 hidden layers of size 150 and 100 respectively. The ReLU activation function was used for all internal nodes and the output node used a sigmoidal activation. For experiments on psuedo-linear performance measures, the network contained three hidden layers of size 60, 60, 80 with sigmoidal activation on all nodes. The DAME method (Algorithm 1) used \(d_\text {int} = 80\).

Batch size The batch size was fixed to 256 for all algorithms except STRUCT-ANN which was given a much larger batch size of 6000. This is because we found the method to do better with larger batch sizes. However, this greatly slowed down gradient computations and each iteration of STRUCT-ANN took much longer than each iteration of DAME. Increasing batch sizes indefinitely is not an option since it rapidly increases memory requirements as the number of temporary variables stored by popular implementations of back-propagation increase linearly with batch size. If one tries to reduce the memory footprint by making multiple forward passes, then the methods usually slow down excessively on GPU architectures. Even so we will see that our algorithms far outperform STRUCT-ANN. We note that even the past work of Song et al. (2016) did have to resort to small batches to avoid memory issues.

Step length We used the Adam optimizer to tune the step lengths for all methods. The learning rate hyper-parameter provided to the Adam method was tuned in the range \(\{0.001,0.01,0.1\}\) on a held-out validation set. Other hyper-parameters of the Adam method were fixed to their default values.

6.1 Experiments with pseudo-linear measures

Figure 1 below shows the performance of DAME on optimizing the F-measure. A naïve training method with the misclassification loss ANN 0-1 yields extremely poor F-measure performance. Moreover, plug-in methods such as those proposed in Koyejo et al. (2014) ANN-PG also perform relatively poorly. DAME on the other hand is able to rapidly offer very good F-measure scores after looking at a fraction of the total data. DAME also outperforms ANN 0-1 and ANN-p on all the datasets. We can also see that STRUCT-ANN offers consistently poor performance. This is because the method keeps predicting almost all points as negative, incurring a very low F-measure score. This may be partly due to using a smaller batch size. However, as pointed out before, STRUCT-ANN was provided a much larger batch size than other methods. Increasing the batch size much further adversely affected performance due to excessive memory usage.

Fig. 1
figure 1

Experiments on maximizing F-measure, a pseudo-linear performance measure. The Y-axis represents test error and the X-axis represents training iterations. We paused the training after every few iterations and recorded the test accuracy of the model at that snapshot. DAME outperforms all other benchmarks by a large margin. STRUCT-ANN tends to classify most data points as negative and gets a poor F-measure score as a result

Fig. 2
figure 2

Experiments with MinTPRTNR (Fig. 2a–d) and QMean (Fig. 2e–h), two concave performance measures and the KL divergence (Fig. 2i–l), a nested concave performance measure. The Y-axis denotes test error and the X-axis denotes training iterations. We paused training after every few iterations and recorded the test accuracies of the models presented by various methods at that time. DUPLE and ANN-p are the leading methods for MinTPRTNR although DUPLE offers more stable performance e.g. on PPI. For the QMean performance measure, DUPLE leads on PPI, KDD08 and A9A and offers more stable performance than ANN-p. DUPLE and ANN-p are the leading methods for the KL divergence with ANN-p leading on the CovType dataset. On most experiments, STRUCT-ANN continues to classify most data points as negative and gets a poor score with respect to all performance measures as a result. On the PPI dataset with MinTPRTNR, STRUCT-ANN does offer non-trivial predictions, however, its behavior is very erratic

6.2 Experiments on concave measures

Figure 2a–d on optimizing MinTPRTNR and Fig. 2e–h on optimizing QMean show that DUPLE offers faster convergence in comparison to ANN 0-1 which has a very hard time obtaining a non-trivial MinTPRTNR score. For the experiment on IJCNN1, we ran the experiment for a longer time to allow ANN 0-1 and STRUCT-ANN to converge and we observe that they are highly time intensive, when compared to DUPLE.

In the experiments with MinTPRTNR, both DUPLE and ANN-p perform comparably though ANN-p gradually starts overfitting whereas DUPLE retains its performance. With QMean, ANN-p starts overfitting on both PPI and KDD08 though DUPLE keeps performing better. DUPLE-NS is slightly slower than ANN-p on IJCNN1. Our experiments show that DUPLE and DUPLE-NS are more consistent and robust to overfitting across datasets than ANN-p.

DUPLE and its variant DUPLE-NS outperform most competitors both in terms of speed as well as accuracy though both DUPLE and DUPLE-NS perform comparably. It is also to be noted that DUPLE not only takes fewer iterations than STRUCT-ANN, but each iteration of DUPLE much faster than that of STRUCT-ANN since we gave STRUCT-ANN a batch size of 6000 whereas DUPLE operated with batch sizes of 256, which is an order of magnitude smaller. Thus, STRUCT-ANN is even slower in convergence than these figures indicate.

6.3 Experiments with nested performance measures

In Fig. 2i–l, we can see the results obtained by DENIM while optimizing the KLD performance measure. It shows rapid convergence to near-perfect quantification scores. The experiments also show that DENIM and DENIMS-NS require far fewer iterations than its competitor ANN 0-1 (whenever ANN 0-1 is successful at all). The STRUCT-ANN benchmark does not seem to appear in the graphs for this performance measure since it always got a very large KL-divergence value by consistently predicting every data point as negative. ANN-p achieves comparable performance to DENIM on most datasets and is the winner on the CoverType dataset. However, we believe that with proper pretraining, we can get much better performance for DENIM on this performance measure.

6.4 Case study: quantification for sentiment analysis

We report the results of experiments comparing the performance of the DENIM on a Twitter sentiment detection challenge problem. The task in this challenge was to correctly ascertain the fraction of tweets exhibiting various sentiments. The performance was measured using the Kullback–Leibler divergence (1). We trained an end-to-end LSTM model trained using DENIM. We also trained an attention-enabled network for the same task using DENIM. Our models accepted raw text in the standard one-hot encoding format and performed task specific optimization and generated task specific vocabulary embeddings. Our representations were 64-dimensional and learnt jointly with other network parameters.

Implementation details All our LSTM models used a single hidden layer with 64 hidden nodes, which gave rise to 64-dimensional hidden state representations. For the LSTM model, the final label was obtained by applying a linear model with a sigmoidal activation function. For the attention models (referred to as AM), the decoder hidden states were set to be 64-dimensional as well. The alignment model was set to be a feed-forward model with a softmax layer. Step lengths were tuned using standard implementations of the Adam method. Training was done by adapting the DENIM method.

DENIM is able to obtain near perfect quantification on both LSTM \(({\text {KLD}} = 0.007)\) as well as AM \(({\text {KLD}} = 0.00002)\) models (see Fig. 3a). In contrast, the classical cross-entropy method with attention model (AM-CE) is unable to obtain satisfactory performance. DENIM converges to optimal test KLD performance in not only far lesser iterations, but also by using far fewer data samples. Also note that the AM models trained with DENIM give KLD losses that are much smaller than what they offer when trained with DENIM.

Fig. 3
figure 3

Results on the Twitter sentiment analysis task. a Convergence to optimal test KLD performance for different RNN models. b Change in quantification performance with distribution drift

Fig. 4
figure 4

Figuring where the attention is. Highlighted words got high attention scores. A red (green) highlight indicates that the tweet was tagged with a negative (positive) sentiment (Color figure online)

We also experimented with artificially changing the fraction of positive and negative examples in order to see the performance of our model under distribution drift (see Fig. 3b). The fraction of negatives and positives in the test set was distorted from their original values by re-sampling. As the test distribution priors are distorted more and more, AM-CE (Attention Model trained with Cross Entropy) performs extremely poorly. DENIM with LSTMs displays some degree of robustness to drift but succumbs at extremely high level of drift. DENIM with AM models on the other hand, remains extremely robust to even a high degree of distribution drift, offering near-zero KLD error.

The benefits of the attention models employed by DENIM allow it to identify critical words in a tweet that signal its polarity. The highlighted words (see Fig. 4) are those for which DENIM assigned an attention score \(\alpha \approx 1\).

7 Conclusion

Our work presents algorithms to train neural networks and other non-linear models in pursuit of non-decomposable performance measures that are popularly used in label-imbalanced training and quantification tasks. Our algorithms offer better performance while using fewer iterations/samples, as compared to traditional cross-entropy based training as well as several other benchmarks usually adopted while optimizing non-decomposable losses such as cost-weighted classification or plug-in methods or loss-augmented inference. This leads to several avenues of future work and improvements. We observed impressive performance boosts when we employed pretraining with the DAME method and the same can be experimented with other methods as well. It would be very interesting to investigate extensions of the DUPLE, DENIM and DAME methods to more complex prediction tasks such as multi-class and multi-label classification tasks. Obtaining a better theoretical understanding of how these methods behave when operated with non-linear models (in particular neural networks) would also be of interest.