Efficient Estimation of AUC in a Sliding Window

Tatti, Nikolaj

doi:10.1007/978-3-030-10925-7_41

Nikolaj Tatti¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11051))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

3882 Accesses
1 Citations
1 Altmetric

Abstract

In many applications, monitoring area under the ROC curve (AUC) in a sliding window over a data stream is a natural way of detecting changes in the system. The drawback is that computing AUC in a sliding window is expensive, especially if the window size is large and the data flow is significant.

In this paper we propose a scheme for maintaining an approximate AUC in a sliding window of length k. More specifically, we propose an algorithm that, given $\epsilon $, estimates AUC within $\epsilon / 2$, and can maintain this estimate in $ \mathcal {O} \mathopen {}\left( (\log k) / \epsilon \right) $ time, per update, as the window slides. This provides a speed-up over the exact computation of AUC, which requires $ \mathcal {O} \mathopen {}\left( k\right) $ time, per update. The speed-up becomes more significant as the size of the window increases. Our estimate is based on grouping the data points together, and using these groups to calculate AUC. The grouping is designed carefully such that (i) the groups are small enough, so that the error stays small, (ii) the number of groups is small, so that enumerating them is not expensive, and (iii) the definition is flexible enough so that we can maintain the groups efficiently.

Our experimental evaluation demonstrates that the average approximation error in practice is much smaller than the approximation guarantee $\epsilon / 2$, and that we can achieve significant speed-ups with only a modest sacrifice in accuracy. Code related to this paper is available at: https://bitbucket.org/orlyanalytics/streamauc.

You have full access to this open access chapter, Download conference paper PDF

Fast Likelihood-Based Change Point Detection

The Sliding-Window Computation Model and Results

Efficient Estimation of Dynamic Density Functions with Applications in Data Streams

Keywords

1 Introduction

Consider monitoring prediction performance in a stream of data points. That is, we first receive a data point d without the label, and we predict the missing label with a score of s, after the prediction we receive the true label $\ell $. We are interested in monitoring how well s predicts $\ell $ as the stream evolves over time.

A good example of such a task is a monitoring system for corporate computers that detects abnormal behavior based on event logs. Here the positive label represents an abnormal event that requires a closer inspection, and such a label can be given, for example, by an expert or triggered automatically. The produced score can be used for decision making, and can be a specific feature or a simple statistic, or the result of some classifier, such as logistic regression. It is vital to monitor such a system continuously to notice breakdowns early. Possible causes may be changes in the underlying distribution or a system failure, due to the software update.

A natural choice to monitor the predictive power of a real-valued score is the area under the ROC curve (AUC) in a sliding window over the stream of events as proposed by Brzezinski and Stefanowski [5]. Unfortunately, maintaining the exact AUC requires $ \mathcal {O} \mathopen {}\left( k\right) $ time, per new event, where k is the size of the window. This may be too expensive if k is large and the rate of the events is significant.

In this paper we propose a technique for estimating AUC efficiently in a sliding window. Namely, we propose an approximation scheme that has $\epsilon / 2$ approximation error guarantee while having $ \mathcal {O} \mathopen {}\left( (\log k) / \epsilon \right) $ update time. That is, the scheme provides a trade-off between the accuracy and computational complexity.

Our approach is straightforward. Computing AUC exactly requires sorting data points and summing over all data points (see Eq. 1 for the exact formula). Maintaining points sorted can be done using binary search trees. However, estimating the sum requires additional tricks. We approach the problem by grouping neighboring data points together, that is, treating them as if the classifier given them the same score.

The key step is to design a grouping such that 3 properties hold at the same time: (i) the groups are small enough so that the relative error is small, more specifically, , (ii) the number of groups is small enough, more specifically, it should be in $ \mathcal {O} \mathopen {}\left( (\log k) / \epsilon \right) $, and (iii) the definition should be flexible enough so that we can do quick updates whenever points arrive or leave the sliding window.

Roughly speaking, in order to accommodate all 3 demands, we will maintain the groups with the two following properties: (i) the number of positive labels in a group is less than or equal to $(1 + \epsilon )$ than the total number of positive labels in all the previous groups, (ii) the number of positive labels in a group, and the next group, is larger than $(1 + \epsilon )$ than the total number of positive labels in all the previous groups. The first property will yield the approximation guarantee, while the second property guarantees that the number of groups remains small. Moreover, these properties are flexible enough so we can perform update procedures quickly.

The rest of the paper is organized as follows. We begin by reminding ourselves the definition of AUC in Sect. 2. Updating the groups of data points quickly requires several auxiliary structures, which we introduce in Sect. 3. We then proceed describing AUC estimation in Sect. 4. The related work is given in Sect. 5. In Sect. 6, we demonstrate that the relative error in practice is much smaller than the guaranteed bound, as well as, study the trade-off between the error and the computational cost. Finally, we conclude the paper with discussion in Sect. 7.

2 Preliminaries

We start with the definition of AUC, and provide a formula for computing it.

Assume that we are given a set of k pairs $W = (s_i, \ell _i)_i^k$, where $\ell _i$ is the true label of the ith instance, $\ell _i = 0, 1$, and $s_i$ is score produced by the classification algorithm. The larger $s_i$, the more we believe that $\ell _i$ should be 0.^{Footnote 1}

In order to predict a label, we need a threshold $\sigma $, and predict that $\ell _i = 0$ if $s_i \ge \sigma $, and $\ell _i = 1$ otherwise. The ROC curve is obtained by varying $\sigma $ and plotting true positive rate as a function of false positive rate. AUC is the area under the ROC curve. To compute AUC, we can use the following formula. Let

be the counts of labels with a score of s. Define also $ hp \mathopen {}\left( s\right) = \sum _{t < s} p \mathopen {}\left( t\right) $. Then,

$$\begin{aligned} auc = \frac{1}{A}\sum _{s} ( hp \mathopen {}\left( s\right) + \frac{1}{2} p \mathopen {}\left( s\right) ) n \mathopen {}\left( s\right) , \end{aligned}$$

(1)

where is the normalization factor. Equation 1 can be computed in $ \mathcal {O} \mathopen {}\left( k \log k + k\right) $ time by first sorting W, computing $ hp $, and enumerating over the sum of Eq. 1.

In a streaming setting, W is a sliding window, and our goal is to compute AUC as W slides over a stream of predictions and labels.

3 Supporting Data Structures for Estimating AUC

In this section we introduce supporting data structures that are needed to compute AUC in a streaming setting. Additional structures and the actual logic for computing AUC are given in the next section. We begin by describing the data structures, then follow with introducing the needed query operations, and finally finish with explaining the update procedures.

3.1 Data Structures

Assume that we have a sequence of pairs $W = (s_i, \ell _i)_{i = 1}^k$, where $s_i$ is the score produced by the classifier, and $\ell _i \in \left\{ 0, 1\right\} $ is the true label.

We store W in a red-black tree T sorted by the scores $s_i$. Let $v \in T$ be a node in T. We will denote the corresponding score of v by $ s \mathopen {}\left( v\right) $. We store and maintain the following information:

Counter , number of pairs in W with a score $ s \mathopen {}\left( v\right) $ and a positive label.
Counter $ n \mathopen {}\left( v\right) = {\left| \left\{ i \mid s_i = s \mathopen {}\left( v\right) , \ell _i = 0\right\} \right| }$, number of pairs in W with a score $ s \mathopen {}\left( v\right) $ and a negative label.
Counter $ accpos \mathopen {}\left( v\right) $, the total sum of $ p \mathopen {}\left( w\right) $, where w ranges over all descendant nodes of v in T, including v itself.
Counter $ accneg \mathopen {}\left( v\right) $, the total sum of $ n \mathopen {}\left( w\right) $, where w ranges over all descendant nodes of v in T, including v itself.

For simplicity, we will add two sentinel nodes to T. The first node will have a score of $-\infty $ and the second node has a score $\infty $. We will assume that the actual entries will never achieve these values. Both sentinel nodes have 0 positive labels and 0 negative labels.

Note that if the scores $s_i$ are unique, then we have either $ p \mathopen {}\left( v\right) = 1$, $ n \mathopen {}\left( v\right) = 0$, or $ p \mathopen {}\left( v\right) = 0$, $ n \mathopen {}\left( v\right) = 1$. However, if there are duplicate scores, then we may have any integer combinations.

In addition to red-black trees, we need to maintain several linked lists, for which we will now introduce the notation. Assume that we are given a subset U of nodes in T. We would like to maintain U in a linked list L, sorted by the score. For that we will need two pointers for each node $u \in U$, namely, $ next \mathopen {}\left( u; L\right) $ indicating the next node in L, and $ prev \mathopen {}\left( u; L\right) $ indicating the previous node in L. Let $u \in U$ and assume that $v = next \mathopen {}\left( u; L\right) $ exists. Let

$$ B = \left\{ w \in T \mid s \mathopen {}\left( u\right) \le s \mathopen {}\left( w\right) < s \mathopen {}\left( v\right) \right\} $$

be the set of nodes in T between u and v. We define

$$ gp \mathopen {}\left( u; L\right) = \sum _{w \in B} p \mathopen {}\left( w\right) \quad \text {and}\quad gn \mathopen {}\left( u; L\right) = \sum _{w \in B} n \mathopen {}\left( w\right) $$

to be the total sums of the labels in the gap B. We will refer to L as weighted linked list. Note that deleting an element from L and maintaining the gap counters can be done in constant time. We will refer to the deletion algorithm by $\textsc {Remove} (L, v)$. Moreover, adding a new element, say v, to L after u can be also done in constant time, if we already know the total sums of labels, say p and n, between u and v. We will refer to the insertion algorithm by $\textsc {Add} (L, u, v, p, n)$.

We say that the node $v \in T$ is positive, if $ p \mathopen {}\left( v\right) > 0$. Similarly, we say that the node v is negative, if $ n \mathopen {}\left( v\right) > 0$. Note that v can be both negative and positive.

We maintain all positive nodes in a weighted linked list, which we will refer as P. Finally, we also store all positive nodes in its own dedicated red-black tree, denoted by $ TP $. For simplicity, we also store the sentinel nodes of T in P and $ TP $ as the first and the last nodes.

3.2 Query Procedures

The first query that we need is $\textsc {MaxPos} (s)$, returning the positive node v with the largest score such that $ s \mathopen {}\left( v\right) \le s$. This can be done in $ \mathcal {O} \mathopen {}\left( \log k\right) $ time using $ TP $, where k is the number of elements in the window.

Maintaining $ accpos \mathopen {}\left( v\right) $ and $ accneg \mathopen {}\left( v\right) $ allows us to query a cumulative sums of counts. Specifically, given a score s, we are interested in

$$\begin{aligned} hp \mathopen {}\left( v\right) = \sum _{v \in T \mid s \mathopen {}\left( v\right)< s} p \mathopen {}\left( v\right) \quad \text {and}\quad hn \mathopen {}\left( v\right) = \sum _{v \in T \mid s \mathopen {}\left( v\right) < s} n \mathopen {}\left( v\right) . \end{aligned}$$

(2)

We can compute both of these sums with $\textsc {HeadStats} (s)$, given in Algorithm 1.

The algorithm assumes that there is a node in T containing s, and proceeds to find it; during the search whenever we go the right branch we add the accumulative sums from the left branch. We omit the trivial proof of correctness. Since the tree is balanced, the running time of $\textsc {HeadStats} (s)$ is $ \mathcal {O} \mathopen {}\left( \log k\right) $, where k is the number of entries in the window.

3.3 Update Procedures

We now continue to the maintenance procedures as we slide the window. This comes down to two procedures: (i) removing an entry from the window and (ii) adding an entry to the window.

We will first describe removing an entry with a positive label and a score s. First we will find the node, say v, with the score s, and reduce the counter $ p \mathopen {}\left( v\right) $ by 1. We will need to update the $ accpos $ counters. However, we only need to do it for the ancestors of v, and there are only $ \mathcal {O} \mathopen {}\left( \log k\right) $ of them, where k is the number of entries in the window, since T is balanced. We also reduce $ gp \mathopen {}\left( v; P\right) $ by 1. In the process, v may become non-positive, and we need to delete it from $ TP $ as well as from P.

Finally, if $ p \mathopen {}\left( v\right) = n \mathopen {}\left( v\right) = 0$, we need to delete the node from T. This may result in rebalancing of the tree, and during the balancing we need to make sure that the counters $ accpos $ and $ accneg $ are properly updated. Luckily, the red-black tree balancing is based on left and right rotations. During these rotations it is easy to maintain the counters without additional costs.

We will refer to this procedure as $\textsc {RemoveTreePos} (s)$ and the pseudo-code is given in Algorithm 2. $\textsc {RemoveTreePos} (s)$ runs in $ \mathcal {O} \mathopen {}\left( \log k\right) $ time.

Deleting an entry with a negative label and a score s is simpler. First, we find the node, say v, with the score s, and reduce the $ n \mathopen {}\left( v\right) $ counter by 1. If needed, we delete v from T. Finally we use $\textsc {MaxPos} (s)$ to find u, the largest positive node with $ s \mathopen {}\left( u\right) \le u$, and reduce $ gn \mathopen {}\left( u; P\right) $ by 1. The procedure, referred as RemoveTreeNeg, runs in $ \mathcal {O} \mathopen {}\left( \log k\right) $ time.

Next, we will describe the addition of a positive entry with a score s. First, we will add the entry s to T, possibly creating a new node in the process. Let v be the node in T with the score s.

If v is a new node, then we need to add it to the weighted linked list P. First, we find the node, say $w = \textsc {MaxPos} (s)$, after which v is supposed to be added. We need to compute the new gap counter $ gn \mathopen {}\left( v; P\right) $. By definition, this value is equal to the total count of negative labels of nodes between w and v, including w. Thus, this new gap counter is equal to $ hn \mathopen {}\left( w\right) - hn \mathopen {}\left( v\right) $. Both counters can be obtained using $\textsc {HeadStats} {}$ in $ \mathcal {O} \mathopen {}\left( \log k\right) $ time.

We will refer to this procedure as $\textsc {AddTreePos} (s)$, and the pseudo-code is given in Algorithm 3. $\textsc {AddTreePos} (s)$ runs in $ \mathcal {O} \mathopen {}\left( \log k\right) $ time.

Adding an entry with negative label and a score s is simpler. First, we will add the entry s to T, possibly creating a new node in the process. Let v be the node in T with a score s. Then, we use $\textsc {MaxPos} (s)$ to find u, the largest positive node with $ s \mathopen {}\left( u\right) \le u$, and increase $ gn \mathopen {}\left( u\right) $ by 1. The procedure, referred as AddTreeNeg, runs in $ \mathcal {O} \mathopen {}\left( \log k\right) $ time.

4 Estimating AUC Efficiently

In order to approximate AUC, we will use Eq. 1 as a basis. However, instead of enumerating over every node we will enumerate only over some selected nodes. The key is how to select the nodes such that we will obtain the approximation guarantee while keeping the number of nodes small.

We will maintain a weighted linked list $ C $. Given $\alpha > 1$, we say that $ C $ is $\alpha $-compressed, if for every two consecutive nodes in $ C $, say v and w, it holds that

$$\begin{aligned} hp \mathopen {}\left( w\right) \le \alpha ( hp \mathopen {}\left( v\right) + p \mathopen {}\left( v\right) ), \end{aligned}$$

(3)

and if $u = next \mathopen {}\left( w; C \right) $ exists, then

$$\begin{aligned} hp \mathopen {}\left( u\right) > \alpha ( hp \mathopen {}\left( v\right) + p \mathopen {}\left( v\right) ). \end{aligned}$$

(4)

Equation 3 will yield the approximation guarantee, while the Eq. 4 will guarantee the running time.

4.1 Computing Approximate AUC

Our next step is to show how we can approximate AUC using a compressed list L in $ \mathcal {O} \mathopen {}\left( L\right) $ time. The idea is as follows. Let B be the set of nodes between two consecutive nodes v and w in L. Normally, we would have to go over each individual node in B when computing AUC. Instead, we will group B to a single node. We will use the total number of positive labels in B, that is, $ gp \mathopen {}\left( v; L\right) - p \mathopen {}\left( v\right) $, for the number of positive labels for this node. Similarly, we will use $ gn \mathopen {}\left( v; L\right) - n \mathopen {}\left( v\right) $ for the negative labels. The pseudo-code for the algorithm is given in Algorithm 4.

Let us first establish that ApproxAUC produces an accurate estimate.

Proposition 1

Let L be $(1 + \epsilon )$-compressed list constructed from the search tree T. Let be an approximate AUC, and let $ auc $ be the correct AUC. Then .

Proof

Let A be as defined in ApproxAUC. Let $v \in T$ be a node, and let u be the node in L with the largest score such that $ s \mathopen {}\left( u\right) < s \mathopen {}\left( v\right) $. Let $w = next \mathopen {}\left( u; L\right) $ be the next node. Define

$$ c_v = \frac{1}{2}( hp \mathopen {}\left( u\right) + p \mathopen {}\left( u\right) + hp \mathopen {}\left( w\right) ) \quad . $$

Then, ApproxAUC returns

(5)

We will argue the approximation guarantee by comparing the terms in Eqs. 1 and 5. Let v be a node in L. Then the corresponding term can be found in sums of both equations.

Let $v \in T \setminus L$, and write $b = hp \mathopen {}\left( v\right) + \frac{1}{2} p \mathopen {}\left( v\right) $. Let u be the node in L with the largest score such that $ s \mathopen {}\left( u\right) \le s \mathopen {}\left( v\right) $. Let $w = next \mathopen {}\left( u; L\right) $ be the next node. By definition, we have $ hp \mathopen {}\left( u\right) + p \mathopen {}\left( u\right) \le b \le hp \mathopen {}\left( w\right) $. Since $c_v$ is the average of the lower bound and the upper bound, we have

$$ {\left| b - c_v\right| } \le \frac{1}{2}( hp \mathopen {}\left( w\right) - hp \mathopen {}\left( u\right) - p \mathopen {}\left( u\right) ) \le \frac{\epsilon }{2} ( hp \mathopen {}\left( u\right) + p \mathopen {}\left( u\right) ) \le \frac{\epsilon b}{2}, $$

where the second inequality follows since L is $(1 + \epsilon )$-compressed.

We have shown that the approximation holds for individual terms. Consequently, it holds for the summands and $ auc $, completing the proof. $\square $

Two remarks are in order. First, since AUC is always smaller than 1, Proposition 1 implies that the approximation is also absolute, . The relative approximation is more accurate if AUC is small. However, if AUC is close to 1, it may make sense to reverse the approximation guarantee, that is, modify the algorithm such that we have a guarantee of . This can be done by flipping the labels, and using $1 - \textsc {ApproxAUC} ( C )$ as the estimate.

ApproxAUC runs in $ \mathcal {O} \mathopen {}\left( {\left| L\right| }\right) $ time. Next we establish that ${\left| L\right| }$ is small.

Proposition 2

Let L be $(1 + \epsilon )$-compressed list. Then ${\left| L\right| } \in \mathcal {O} \mathopen {}\left( \frac{\log k }{ \epsilon }\right) $, where k is the number of entries in the sliding window.

Proof

Write $L = u_0, \ldots , u_m$. Since L is $(1 + \epsilon )$-compressed, $ hp \mathopen {}\left( u_2\right) \ge 1$ and $ hp \mathopen {}\left( u_{i + 2}\right) > (1 + \epsilon ) hp \mathopen {}\left( u_i\right) $. Since $ hp \mathopen {}\left( u_m\right) \le k$, we have $(1 + \epsilon )^{\left\lfloor m / 2\right\rfloor - 1} \le k$. Solving for m leads to $m \in \mathcal {O} \mathopen {}\left( \frac{\log k}{\log {1 + \epsilon }}\right) \subseteq \mathcal {O} \mathopen {}\left( \frac{\log k}{ \epsilon }\right) $. $\square $

4.2 Updating the Data Structures

Our final step is to describe procedures for maintaining $ C $ as the data window slides. In the previous section, we already described how to update the search trees T and $ TP $ as well as the weighed linked list P. Our next step is to make sure that the weighted linked list $ C $ stays $\alpha $-compressed.

We will need two utility routines. The first routine, $\textsc {AddNext} $, given in Algorithm 5, takes as input a node included in both P and $ C $, and adds to $ C $ the next node in P. This procedure will be used extensively to add extra nodes to $ C $ so that Eq. 3 is satisfied.

Next, we demonstrate how AddNext enforces Eq. 3.

Lemma 1

Assume that a linked list L satisfies Eq. 3 for consecutive positive nodes v and w. Add or remove a single positive entry with a score s, and assume that v and w are still positive. Let u be the next positive node from v in P, and let $L'$ be the list obtained from L by adding a positive node u. Then Eq. 3 holds for $L'$ for the nodes v and u as well as for the nodes u and w.

Proof

Let us write $c_x = hp \mathopen {}\left( x\right) $ before modifying T, and $c_x' = hp \mathopen {}\left( x\right) $ after the modification. Similarly, write $b_x = p \mathopen {}\left( x\right) $ before the modification, and $b_x' = p \mathopen {}\left( x\right) $ after the modification.

Since u is the next positive node of v, we have $c_u' = c_v' + b_v' \le \alpha (c_v' + b_v')$,

proving the case of v and u.

If $s \ge s \mathopen {}\left( w\right) $, then $c_w' = c_w \le \alpha (c_v + b_v) = \alpha c_u = \alpha c_u' \le \alpha (c_u' + b_u')$.

If we are adding s and $s < s \mathopen {}\left( w\right) $, then

$$ c_w' = c_w + 1 \le \alpha (c_v + b_v + 1) \le \alpha (c_v' + b_v' + 1) = \alpha (c_u' + 1) \le \alpha (c_u' + b_u'), $$

where the last inequality holds since u is a positive node.

If we are removing s and $s < s \mathopen {}\left( w\right) $, then $c_v + b_v - 1 \le c_v' + b_v'$, and so

$$ c_w' \le c_w \le \alpha (c_v + b_v) \le \alpha (c_v' + b_v' + 1) = \alpha (c_u' + 1) \le \alpha (c_u' + b_u'). $$

This proves the case for u and w, and completes the proof. $\square $

Note that the execution of AddNext is done in constant time, the key step for this being able to obtain $ gp \mathopen {}\left( v, P\right) = p \mathopen {}\left( v\right) $ and $ gn \mathopen {}\left( v, P\right) $ in constant time. This is the main reason why we maintain P.

While the first utility algorithm adds new entries to $ C $, our second utility algorithm, Compress, given in Algorithm 6 tries to delete as many entries as possible. It assumes that the input list $ C $ already satisfies Eq. 3, and searches for violations of Eq. 4. Whenever such violation is found, the algorithm proceeds deleting the middle node. Note that deleting this node will not violate Eq. 3. Consequently, upon termination, the resulted linked list will be $\alpha $-compressed. The computational complexity of $\textsc {Compress} ( C , \alpha )$ is $ \mathcal {O} \mathopen {}\left( {\left| C \right| }\right) $.

Next, we describe the update steps. We will start with the easier ones:

Adding negative entry: Given a negative entry with a score s, we first invoke AddTreeNeg. Then we search $u \in C $ with the largest score such that $ s \mathopen {}\left( u\right) \le s$. Once this entry is found, we increase $ gn \mathopen {}\left( u; C \right) $ by 1.

Removing negative entry: Given a negative entry with a score s, we first invoke RemoveTreeNeg. Then we search $u \in C $ with the largest score such that $ s \mathopen {}\left( u\right) \le s$. Once this entry is found, we decrease $ gn \mathopen {}\left( u; C \right) $ by 1.

Since the positive labels are not modified, $ C $ remains $\alpha $-compressed, so there is no need for modifying $ C $. The running time for both routines is $ \mathcal {O} \mathopen {}\left( \log k + \frac{\log k}{\epsilon }\right) $.

Let us now consider more complex cases:

Adding positive entry: Given a positive entry with a score s, we first invoke AddTreePos. Then we search $u \in C $ with the largest score such that $ s \mathopen {}\left( u\right) \le s$. Once this entry is found, we increase $ gp \mathopen {}\left( u; C \right) $ by 1. By doing so, we may have violated Eq. 3 for u. Lemma 1 states that we can correct the problem by adding the next positive node for each violation. However, a closer inspection of the proof shows that there can be only one violation, namely u. Consequently, we check if Eq. 3 holds for u, and if it fails, we add the next positive node by invoking $\textsc {AddNext} (u, C , P)$. Finally, we call $\textsc {Compress} ( C , \alpha )$ to force Eq. 4; ensuring that $ C $ is $\alpha $-compressed. The pseudo-code for AddPos is given in Algorithm 7.

Removing positive entry: Assume that we are given a positive entry with a score s. First we search $u \in C $ with the largest score such that $ s \mathopen {}\left( u\right) \le s$. Once this entry is found, we decrease $ gp \mathopen {}\left( u; C \right) $ by 1. If u is no longer positive, we add the next positive entry to $ C $ and delete u from $ C $. The reason for this is explained later. We proceed by deleting the entry from the search trees with RemoveTreePos.

Next we make sure that Eq. 3 holds for every consecutive nodes v and w. There are two possible cases: (i) v and w were consecutive nodes in $ C $ before the deletion, or (ii) u was deleted from $ C $, and w was the next positive node before the deletion. In the first case, Lemma 1 guarantees that using AddNext forces Eq. 3. In the second case, note that $ hp \mathopen {}\left( w\right) $ after the deletion is equal to $ hp \mathopen {}\left( u\right) $ before the deletion of u. This implies that since Eq. 3 held for v and u before the deletion, Eq. 3 holds for v and w after the deletion. Finally, we enforce Eq. 4 with Compress. The pseudo-code for RemovePos is given in Algorithm 8.

In both routines, modifying the search trees is done in $ \mathcal {O} \mathopen {}\left( \log k\right) $ time, while modifying $ C $ is done in $ \mathcal {O} \mathopen {}\left( {\left| C \right| }\right) \subseteq \mathcal {O} \mathopen {}\left( \frac{\log k}{\epsilon }\right) $ time.

5 Related Work

The closest related work is a study by Bouckaert [3], where the author divided the ROC curve area into bins, allowing only to maintain the counters for individual bins. However, the number of the bins as well as the bins were static, and no direct approximation guarantees were provided.

Using AUC in a streaming setting was proposed in a paper by Brzezinski and Stefanowski [5]. Here the authors use red-black tree, similar to T, to maintain the order of the data points in a sliding window, but they recompute the AUC from scratch every time, leading to a update time of $ \mathcal {O} \mathopen {}\left( k + \log k\right) $. In fact, our approach is essentially equivalent to their approach if we set $\epsilon = 0$.

Note that using AUC is useful if we do not have a threshold to binarize the score. If we do have such a threshold, then we can easily maintain a confusion matrix, which allows us to compute many metrics, such as, accuracy, recall, F1-measure [8, 9], and Kappa-statistic [2, 13]. However, determining such a threshold may be extremely difficult since it depends on the misclassification costs. Selecting such costs may come down to a(n educated) guess.

We based our AUC calculation on a sliding window, that is, we abruptly forget the data points after certain period of time. The other option is to gradually forget the data points, for example using an exponential decay (see a survey by Gama et al. [10] for such examples). There are currently no methodology for efficiently estimating AUC under exponential decay, and this is a promising future line of work.

In a related line of work, training a classifier by optimizing AUC in a static setting has been proposed by Ataman et al. [1], Brefeld and Scheffer [4], Ferri et al. [7], Herschtal and Raskutti [12]. Here, AUC is used as an optimization criterion, and needs to be recomputed from scratch in $ \mathcal {O} \mathopen {}\left( {\left| D\right| } \log {\left| D\right| }\right) $ time. Naturally, this may be too expensive for large databases. Calders and Jaroszewicz [6] estimated AUC as a continuous function. This allowed to view AUC as a smooth function, and optimize the parameters of the underlying classifier efficiently using gradient descent techniques. While the underlying problem is the same as ours, that is, computing AUC from scratch is expensive, the maintenance procedures make problems orthogonal: in our settings we are required to do updates when a single data point leaves or enters to our window, whereas here AUC needs to be recomputed since the scores (and the order) for all existing data points have changed. However, it may be possible and fruitful to use similar tricks in order to speed-up the AUC calculation when optimizing classifiers. We leave this as a future line of work.

Hand [11] proposed a fascinating alternative for AUC. Namely, the author views AUC as the optimal classification loss averaged (with weights) over misclassification cost ratio. He then argues that AUC evaluates incoherently, namely the cost ratio weights depend on the ROC curve, and then he proposes a different coherent alternative. The computation of proposed metric, though more complex, shares some similarity with AUC, and it may be possible to use similar techniques as in this paper to approximate this measure efficiently in a stream.

6 Experimental Evaluation

In this section we present our experimental evaluation. We have two goals: to demonstrate the relative error in practice as a function of the guaranteed error, and to demonstrate the trade-off between the computational cost and the error.

We implemented calculation of AUC using C++, and conducted the experiments using Macbook Air (1.6 GHz Intel Core i5 / 8 GB Memory).^{Footnote 2} As a classifier we used Python’s scikit implementation of logistic regression. Computing AUC was done in a separate job from training the classifier as well as scoring new data points; the reported running times measure only the computation of AUC over the whole test data.

We used 3 UCI datasets^{Footnote 3} for our experiments, see Table 1: (i) Hepmass, a dataset containing features from simulated particle collisions, split in training and test datasets. We used the Hepmass-1000 variant. Due to the memory restrictions of Python, we only used a sample of $500\,000$ data points from training data. We used the whole test dataset. (ii) Miniboone: a data used to distinguish electron neutrinos from muon neutrinos. Since the original data has data points ordered by label, we permuted the dataset and split it to training and test data. (iii) Tvads: a data containing features for identifying commercials from TV news channels. We used BBC and CNN channels as training data, and the remaining channels as test data.

Table 1. Basic characteristics of the benchmark datasets.

Full size table

Actual Error vs. Guarantee: Proposition 1 states that the error cannot be more than $\epsilon / 2$. First, we test the actual relative error, that is, as a function of $\epsilon $. Here we set the sliding window size to be 1000.

The top row of Fig. 1 shows the relative error, averaged over all sliding windows, and the bottom row of Fig. 1 shows the relative error, maximized over all sliding windows. From the results we see that both maximum and average error are smaller than the guaranteed. Especially, the average error is typically smaller of several orders than the theoretical guarantee. As expected, both errors tend to increase as $\epsilon $ increases.

Computational Cost vs. Error: Next, we test the trade-off between the computational cost and the relative error. The top row of Fig. 2 shows the running time as a function of the average error, while the bottom row of Fig. 2 shows the size of $(1 + \epsilon )$-compressed list as a function of the average error. Here, we used a window size of 1000.

From the results, we see the trade-off between the error and the running time: as the error increases, the running time drops. This is mainly due to the fewer elements in the compressed list as demonstrated in the bottom row. The running stabilizes for larger errors; this is due to the operations that do not depend on $\epsilon $, such as maintaining binary tree T.

Computational Cost vs. Window Size: Computing exact AUC requires $ \mathcal {O} \mathopen {}\left( k\right) $ time while estimating AUC is $ \mathcal {O} \mathopen {}\left( \log k / \epsilon \right) $. Consequently, the speed-up should increase as the size of the sliding window increases. We demonstrate this effect in Fig. 3 using the Miniboone dataset. We see that the speed-up increases as a function of window size: computing estimates using $\epsilon = 0.1$ is 17 times faster for a window size of $10\,000$.

7 Concluding Remarks

In this paper we introduced an approximation scheme that allows to maintain an estimate AUC in a sliding window within the guaranteed relative error of $\epsilon / 2$ in $ \mathcal {O} \mathopen {}\left( (\log k) / \epsilon \right) $ time. The key idea behind the estimator is to group the data points. The grouping has to be done cleverly so that the error stays small, the number of groups stay small, and the list can be updated quickly. We achieve this by maintaining groups, where the number of positive labels can only increase relatively by $(1 + \epsilon )$ within one group, and must increase by at least $(1 + \epsilon )$ within two groups. Our experimental evaluation suggests that the average error in practice is much smaller than the guaranteed approximation, and that we can achieve significant speed-up, especially as the window size grows.

Our algorithm relies on the fact that the data points have no weights, specifically, Lemma 1 relies on the fact that the update may change the counters only by 1. If the data points are weighted, a different approach is required: It is possible to construct $(1 + \epsilon )$-list from a scratch. The key idea here is a new query, where, given a threshold $\sigma $, we look for a node v that has the largest $ hp \mathopen {}\left( v\right) $ such that $ hp \mathopen {}\left( v\right) \le \sigma $. This query can be done using the same trick as in HeadStats, and it requires $ \mathcal {O} \mathopen {}\left( \log k\right) $ time. The list can be then constructed by calling this query with exponentially increasing thresholds $ \mathcal {O} \mathopen {}\left( (\log k) / \epsilon \right) $ times. This leads to a running time of $ \mathcal {O} \mathopen {}\left( (\log ^2 k) / \epsilon \right) $. An interesting direction for future work is to improve this complexity to, say, $ \mathcal {O} \mathopen {}\left( (\log k) / \epsilon \right) $.

Notes

1.
We chose this direction due to the notational convenience.
2.
See https://bitbucket.org/orlyanalytics/streamauc for the implementation.
3.
https://archive.ics.uci.edu/.

References

Ataman, K., Streetr, W., Zhang, Y.: Learning to rank by maximizing AUC with linear programming. In: International Joint Conference on Neural Networks, IJCNN 2006, pp. 123–129. IEEE (2006)
Google Scholar
Bifet, A., Frank, E.: Sentiment knowledge discovery in Twitter streaming data. In: Pfahringer, B., Holmes, G., Hoffmann, A. (eds.) DS 2010. LNCS (LNAI), vol. 6332, pp. 1–15. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16184-1_1
Chapter Google Scholar
Bouckaert, R.R.: Efficient AUC learning curve calculation. In: Australasian Joint Conference on Artificial Intelligence, pp. 181–191 (2006)
Google Scholar
Brefeld, U., Scheffer, T.: AUC maximizing support vector learning. In: Workshop on ROC Analysis in Machine Learning (2005)
Google Scholar
Brzezinski, D., Stefanowski, J.: Prequential AUC: properties of the area under the ROC curve for data streams with concept drift. KAIS 52(2), 531–562 (2017)
Google Scholar
Calders, T., Jaroszewicz, S.: Efficient AUC optimization for classification. In: PKDD, pp. 42–53 (2007)
Google Scholar
Ferri, C., Flach, P., Hernández-Orallo, J.: Learning decision trees using the area under the ROC curve. ICML 2, 139–146 (2002)
Google Scholar
Gama, J.: Knowledge Discovery from Data Streams. CRC Press, Boca Raton (2010)
Book Google Scholar
Gama, J., Sebastião, R., Rodrigues, P.P.: On evaluating stream learning algorithms. Mach. Learn. 90(3), 317–346 (2013)
Article MathSciNet Google Scholar
Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. 46(4), 44 (2014)
Article Google Scholar
Hand, D.J.: Measuring classifier performance: a coherent alternative to the area under the ROC curve. Mach. Learn. 77(1), 103–123 (2009)
Article Google Scholar
Herschtal, A., Raskutti, B.: Optimising area under the roc curve using gradient descent. In: Proceedings of the twenty-first international conference on Machine learning, p. 49. ACM (2004)
Google Scholar
Žliobaitė, I., Bifet, A., Read, J., Pfahringer, B., Holmes, G.: Evaluation methods and decision theory for classification of streaming data with temporal dependence. Mach. Learn. 98(3), 455–482 (2015)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

F-Secure, Helsinki, Finland
Nikolaj Tatti

Authors

Nikolaj Tatti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nikolaj Tatti .

Editor information

Editors and Affiliations

IBM Research - Ireland, Dublin, Ireland
Michele Berlingerio
Institute for Scientific Interchange, Turin, Italy
Francesco Bonchi
University of Nottingham, Nottingham, UK
Thomas Gärtner
University College Dublin, Dublin, Ireland
Neil Hurley
University College Dublin, Dublin, Ireland
Georgiana Ifrim

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tatti, N. (2019). Efficient Estimation of AUC in a Sliding Window. In: Berlingerio, M., Bonchi, F., Gärtner, T., Hurley, N., Ifrim, G. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2018. Lecture Notes in Computer Science(), vol 11051. Springer, Cham. https://doi.org/10.1007/978-3-030-10925-7_41

Download citation

DOI: https://doi.org/10.1007/978-3-030-10925-7_41
Published: 18 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-10924-0
Online ISBN: 978-3-030-10925-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Efficient Estimation of AUC in a Sliding Window

Abstract

Similar content being viewed by others

Fast Likelihood-Based Change Point Detection

The Sliding-Window Computation Model and Results

Efficient Estimation of Dynamic Density Functions with Applications in Data Streams

Keywords

1 Introduction

2 Preliminaries