Keywords

1 Introduction

Active learning has been widely used in computer visions to address the samples imbalance problem that the available labeled data is much less than the unlabeled data [18, 35]. It is an iterative loop to find the most valuable samples for the oracle to label, and gradually improves the model generalization ability until the convergence condition is satisfied [39]. There are two motivations behind the design of a practical active learning algorithm, namely, uncertainty and representativeness [8, 15]. Uncertainty is to improve the models’ generalization ability and representativeness is to prevent the bias of the models.

Among all the active learning based tasks, multi-label classification, which aims to assign each object with multiple labels, may be the most difficult and costly one [10, 17, 42]. In current research, active learning for multi-label learning has become even more important, reducing the costs of the various multi-label tasks [6, 7, 38, 41]. State-of-the-art multi-label active learning can be classified into three categories based on the query function used to select the valuable samples. The first category relies on the labeled data to design a query function with uncertainty [25, 28]. In such methods, the design of the query function ignores the latent structural information in the large-scale unlabeled data, leading to a serious sample bias and an undesirable performance. To eliminate this problem, the second category, which depends on the representativeness, has been developed [26]. In these approaches, the structural information of the unlabeled data is elaborately considered, but the discriminative (uncertain) information is discarded. Therefore, a large number of samples would be required before an optimal boundary is found. Since utilizing either the uncertainty criterion or the representativeness criterion may not achieve a desirable performance, the third category which combines both criteria borns naturally and it can effectively solve these problems [8, 15]. However, the approaches in the third category are either heuristic in designing the specific query criterion or ad hoc in measuring the uncertainty and representativeness of the samples. The uncertainty still just relies on the limited labeled data. Most importantly, previous works ignore the outlier labels that exist in multi-label classification when designing a query model for active learning.

However, the outlier labels have significant influence on the measurement of uncertainty and representativeness in multi-label learning. In the following, we will discuss the outlier label and its negative influence on the measurement of uncertainty and representativeness in details.

Fig. 1.
figure 1

The influence of outlier label in the learning process.

Figure 1 shows a simple example about the influence of outlier labels. As the input, we annotate the image with three labels, namely tree, elephant and lion. Hence, the feature of image is combined with three parts, the feature of tree, the feature of elephant and the feature of lion. Intuitively, in the image feature, the feature of tree is much more than elephant and lion, and the feature of lion is the least. If we use the image with the three labels to learn a lion/ non-lion binary classification model, the model would actually depend on the trees and elephants features rather than the lions. Thus it would be a biased model for classifying the lion and the non-lions. Given the test image where a lion covers the most regions in the image, the trained model would not recognize it. If we use such a model to measure the uncertainty in active learning, it may cause error measurement for images with lion label. We name the lion label in the input image as an outlier label.

Fig. 2.
figure 2

The interface of two properties for outlier labels. Left: The outlier label (Lion) is relevant to the image; right: the outlier (Lion) is much less relevant to the image than the most relevant label (Tree) is.

Furthermore, we present the formal definition of the outlier label. Denote x, \(y_1\) and \(y_2\) as the selected instance and two relevant labels, respectively. Define \(y_1\) as the outlier label, if it has two properties. The first one is that \(y_1\) is a relevant label to the instance x, and the second is that \(y_1\) is much less relevant to x than \(y_2\) is. If the trained model could determine whether \(y_1\) or \(y_2\) is the outlier label to x, it would be very useful to build a promising model for a better query. Figure 2 shows the two properties. The definition of the outlier label is consistent with the fact that, given an image, some labels relevance to it is apparent, which can be recognized at first glance by the oracle, and some labels relevance is veiled, which may need much effort for the oracle to label. The definition of outlier label is also consistent with the query types proposed in [14]. For two multi-label images, if they have the same labels, but the outlier label is different in their labels, this may lead to the features of the two image have a large difference, therefore, it is very hard to diagnose the similarity between two instances with outlier labels. In Fig. 3, we provide a simple example to show such a problem, and we present the similarity between the sift features with Gaussian kernel [1, 22, 30, 32], and the labels similarity based on MCC. Intuitively, the similarity between image 1 and image 2 should be larger than the similarity between image 2 and image 3, since the labels in image 1 and image 2 are exactly the same. However, the result is opposite when the similarity is measured with their sift features. This because the outlier label is lion in image 1, tree trunk and lion are two outlier labels in image 2, and the different outlier labels largely increase the difference between the features of the two images. In summary, the measurement of uncertainty and representativeness would be deteriorated with the outlier labels.

Fig. 3.
figure 3

The influence of the outlier labels for the measurement of similarity

To address the above problems, in this paper, we propose the robust multi-label active learning (RMLAL) algorithm, which effectively combines the uncertainty and representativeness based on the MCC [40].

As to robustness, the correntropy has proved promising in information theoretic learning (ITL) and can efficiently handle the large outliers [40]. In traditional active learning algorithms, the mean square error (MSE) cannot easily control the large errors caused by the outliers [12, 19, 23, 27, 29, 36]. We therefore replace the MSE criterion with the MCC in the proposed formulation with a minimum margin model. In this way, the proposed method is able to eliminate the outlier samples, making the query function more robust.

As to discriminative labeling, we use the MCC to measure the loss between the true label and the prediction label. MCC can improve the most discriminative information and suppress the little useless information or unexpected information. Hence, with MCC in the proposed method, if the label is not an outlier label, it will play an important role in the query model construction. Otherwise, the model will decrease the influence of the outlier label to measure the uncertainty. Then the discriminative labels effects are improved and the outlier labels are suppressed, and the discriminative labeling can be achieved.

For representativeness, we mix the prediction labels of unlabeled data with the MCC as the representativeness. As is shown in Fig. 3, although the samples have the same labels, their outlier labels are different, making their features distinguishing. If we just use the corresponding features to measure the similarity, it will lead to a wrong diagnosis. Hence, we propose to use the combination of labels and sample similarity to define the consistency between the labels and samples. With different space measurement making up for each other [33, 34, 37], the combination makes the measurement of representativeness more general. To decrease the computational complexity of the proposed method, the half-quadratic optimization technique is adopted to optimize the MCC. The contributions of our work can be summarized as follows:

  • To the best of our knowledge, it is the first work to focus on the outlier labels in multi-label active learning. We find a robust and effective query model for multi-label active learning.

  • The prediction labels of unlabeled data and the labels of labeled data are utilized with MCC to merge the uncertain and representative information, deriving an approach to make the uncertain information more precise.

  • The proposed representative measurement considers labels similarity by MCC. It can effectively handle the outlier labels and makes the similarity more accuracy for multi-label data, and also provides a way to merge representativeness into uncertainty.

The rest of the paper is organized as following: Sect. 2 briefly introduces the related works. Then Sect. 3 defines and discusses a new objective for robust multi-label active learning and proposes an algorithm based on half-quadratic optimization. Section 4 evaluates our method on several benchmark multi-label data sets. Finally, we summarize the paper in Sect. 5.

2 Related Works

Since multi-label problem is universal in the real world, it has drawn great interests in many fields. For a multi-label object, it needs an oracle to consider all the relevant labels, leading to the labeling of multi-label tasks is more costly than single label learning, however, the research of active learning on multi-label is still less.

In multi-label learning, one instance is corresponding to more than one labels. To solve a multi-label problem, it is a direct way to convert it to several binary problems [21, 31]. In these approaches, the uncertainty is measured for each label, and then a combining strategy is adopted to measure the uncertainty of one instance. [21] trained a probabilistic binary logistic regression classifier with different levels, and combined them with level switching strategy for adaptive selection. [31] converted the SVM margin to a probability score to select the instance for query. Recently, [26] selected the valuable instances by minimizing the Expected Error Reduction. Other works have done by combining the informativeness and representativeness together for a better query [8, 20]. [20] combined the label cardinality inconsistency and the separation margin with a tradeoff parameter. [8] incorporated the data distribution in the selection process by using the appropriate dissimilarity between pairs of samples with sparse modeling representative selection for query. All the above algorithms were designed to query all the labels of the query instances. Another approaches have been developed to query the label-instance pairs with relevant label and instance at each iteration [14, 16]. [14] queried the instance with relevant labels based on the types. [16] selected label-instance pairs based on a label ranking model. In these approaches, some important labels may be lost. In this study, considering the combination of informativeness and representativeness is very effective in active learning, we adopt this strategy.

No matter selecting the instance by all the labels or by the label-instance pairs, most of the active learning algorithms only selected the uncertain instance based on very limited samples, and ignored the labels information. For example, given all the labels to one instance, if the outlier labels are too much in label ranking, such instance may decrease the performance of the task. Moreover, given the relevant labels to one instance, some relevant labels may be lose with the limited query labels. To address these problems, we use the prediction labels of unlabeled data to enhance the uncertain measurement and adopt the MCC to consider the much relevant labels as much as possible except the outlier labels. As far to our knowledge, it is the first time to adopt the MCC in multi-label active learning with data labels for query.

3 Methodology

Suppose we are given a multi-label data set \(D = \{x_i, x_2,\ldots ,x_n\}\) with n samples and C possible labels for each sample. Initially, we label l samples in D. Without loss of generality, we denote the l labeled samples as set \(L = \{(x_1,y_1), (x_2,y_2),\ldots , (x_l,y_l)\}\), where \(y_i = (y_{i1},y_{i2},\ldots , y_{iC})\) is the labels set for sample \(x_i\), with \(y_{ik}\in \{-1,1\}\); and the remaining \(u = n - l\) unlabeled samples are denoted as set \(U = \{x_{l+1}, x_{l+2},\ldots , x_{l+u}\}\). It is the candidate set for active learning. Moreover, we denote \(x_q\) as the query sample in the active learning process. In each iteration, we select \(x_q \in U\). And we use the bold symbol to denote the matrix or vector. In the following discussion, the symbols are used as above.

3.1 Maximum Correntropy Criterion

In multi-label classification tasks, the outlier labels pose a great challenge to train a precise classifier, mainly due to the unpredictable nature of the errors (bias) caused by these outliers. In active learning, in particular, the limited labeled samples with outliers easily lead to great bias. Since in active learning the supervised information is limited, it is hard to avoid the influence of the outlier labels when building the supervised model. This directly leads to the bias of uncertain information, furthermore makes the query instances are undesirable or even leads to bad performance.

Recently, the concept of correntropy was firstly proposed in ITL and it had drawn much attention in the signal processing and machine learning community for robust analysis, which can effectively handle the outliers [13]. In fact, correntropy is a similarity measure between two arbitrary random variables a and b [13], defined by

$$\begin{aligned} {\hat{V}_\sigma }\left( {a,b} \right) = E\left[ {{K_\sigma }\left( {a,b} \right) } \right] \end{aligned}$$
(1)

where \(K_\sigma (\cdot )\) is the kernel function and \(E[\cdot ]\) is the expectation operator. We can observe that the definition of correntropy bases the kernel method, so it also has the same advantages that the kernel technique owns. However, different from the conventional kernel based methods, correntropy works independently with pairwise samples and has a strong theoretical foundation. With such a definition, the properties of correntropy are symmetric, positive and bounded.

Since the joint probability density function of a and b in practice is unknown, and the available data \(\{a_i,b_i\}_{i=1}^n\) are usually finite, the sample estimator of correntropy is usually adopted by

$$\begin{aligned} {\hat{V}_\sigma }\left( {a,b} \right) = E[{{K_\sigma }\left( {{a,b}} \right) }] \end{aligned}$$
(2)

where \(K_\sigma (x_1,x_2) = exp(-||x_1-x_2||^2)/2\sigma ^2)\). According to [13], the correntropy between a and b is given by

$$\begin{aligned} \max \limits _{p^{'}} \frac{1}{n}\sum \limits _{i = 1}^n {{K_\sigma }\left( {{a_i},{b_i}} \right) } \end{aligned}$$
(3)

The objective function (3) is called maximum correntropy criterion (MCC) [13], where \(p^{'}\) is the auxiliary parameter to be specified in Proposition 1. Compared with mean square error (MSE), which is a global metric, the correntropy is a local metric. That means the correntropy value is mainly determined by the kernel function along the line A = B [11].

3.2 The Proposed Approach

Usually, the uncertainty is measured according to the labeled data whereas the representativeness according to the unlabeled data. In this paper, we propose a novel approach to merge the uncertainty and representativeness of instances in active learning. Minimum margin is the most popular and direct approach to measure the uncertainty, which chooses the unlabeled sample by its prediction uncertainty [15]. Let \(f^{*}\) be the classifier that is trained by the labeled samples, and the sample \(x_q\) that we want to query in the unlabeled data based on the margin can be found as follows:

$$\begin{aligned} {x_q} = \mathop {\arg \min }\limits _{{x_i} \in U} \left| {{f^*}\left( {{x_i}} \right) } \right| \end{aligned}$$
(4)

Generally, with the labeled samples, we can find a classification model \(f^{*}\) for a binary class problem in supervised approach with the following loss function:

$$\begin{aligned} {f^*} = \mathop {\arg \min }\limits _{f \in {\mathcal {H}}} \sum \nolimits _{{x_i} \in L} {\ell \left( {{Y_i},f\left( {{x_i}} \right) } \right) } + \lambda \left\| f \right\| _{\mathcal {H}}^2 \end{aligned}$$
(5)

where \(\mathcal {H}\) is a reproducing kernel Hilbert space endowed with kernel function \(K(\cdot )\), \(\ell (\cdot )\) is the loss function and \(Y_i\) belongs to \(\{1, -1\}\). Following the works of [15], the criterion of the minimum margin can be written as

$$\begin{aligned} {x_q} = \mathop {\arg \min }\limits _{{x_j} \in U} \mathop {\max }\limits _{{Y_j} \pm 1 } \mathop {\min }\limits _{f \in {\mathcal {H}}} \sum \limits _{{x_i} \in L} {\ell \left( {{Y_i},f\left( {{x_i}} \right) } \right) } + \lambda \left\| f \right\| _{\mathcal {H}}^2 + \ell \left( {{Y_j},f\left( {{x_j}} \right) } \right) \end{aligned}$$
(6)

\(Y_j\) is a pseudo label for the unlabeled sample \(x_j\). Since it is a binary class problem, \(Y_j\) is 1 or \(-1\). Hence, we define \(Y_j=-sign(f(x_j))\). In previous works, the loss function is adopted with quadratic loss for MSE, but it is not robust for the occasion of outliers. To overcome this problem, considering the properties of MCC, we introduce the MCC as the loss function. Different from MSE by minimizing the loss to solve minimization problem, MCC solves the minimization problem by maximizing the loss, presented by

$$\begin{aligned} \mathop {\arg \max }\limits _{{x_q} \in U,{Y_q}=\pm 1,f \in {\mathcal {H}}} \sum \limits _{{x_i} \in L} {\exp \left( { - \frac{{{{\left\| {{Y_i} - f\left( {{x_i}} \right) } \right\| }^2}}}{{2{\sigma ^2}}}} \right) } - \lambda \left\| f \right\| _{\mathcal {H}}^2 + \exp \left( { - \frac{{{{\left\| {{Y_q} - f\left( {{x_q}} \right) } \right\| }^2}}}{{2{\sigma ^2}}}} \right) \end{aligned}$$
(7)

where \(\sigma \) is the kernel width. Following the minimum margin approach, the objective function (7) is equal to (4). In our work, we extend multi-label classification as several binary classification problems with label correlation [15]. For simple, we assume the label correlation is independent by learning one classifier for each label independently. Then, we use the summation as minimum margin in multi-label learning and use \(f_i\) as the classifier between \(i^{th}\) label and the other labels. The multi-label active learning to query the sample with minimum margin approach based on MCC with the worst case is given by

$$\begin{aligned} \begin{aligned} \mathcal {L}(x_q,f,L)=\mathop {\arg \max }\limits _{{x_j} \in U,{f_k} \in {\mathcal {H}}:k = \left\{ {1,2,..C} \right\} } \sum \limits _{{x_i} \in L} {\sum \limits _{k = 1}^C {\exp \left( { - \frac{{{{\left\| {{y_{ik}} - {f_k}\left( {{x_i}} \right) } \right\| }^2}}}{{2{\sigma ^2}}}} \right) } } \\ - \lambda \sum \limits _{k = 1}^C {\left\| {{f_k}} \right\| _{\mathcal {H}}^2} + \sum \limits _{k = 1}^C {\exp \left( { - \frac{{\left( {1 + 2\left| {{f_k}\left( {{x_q}} \right) } \right| + {f_k}{{\left( {{x_q}} \right) }^2}} \right) }}{{2{\sigma ^2}}}} \right) } \end{aligned} \end{aligned}$$
(8)

The labeled samples in L are very limited, so that it is very important to utilize the unlabeled data to enhance the performance of active learning. Since the labels of the unlabeled data are unknown, it is hard to add the unlabeled data in the supervised model. For the purpose to enhance the uncertain information, we merge the representative information into the uncertain information by prediction labels of unlabeled data. However, the current similarity is difficult to use the unlabeled data to enhance the uncertain information just with features. To overcome this problem, and considering the outlier labels influence, we take the prediction labels of unlabeled data into consideration for similarity measurement. We define a novel consistency between labels and sample similarity with sample-label pairs based on MCC as

$$\begin{aligned} s\left( {\left( {{x_i},{y_i}} \right) ,\left( {{x_j},{y_j}} \right) } \right) = \exp \left( { - \frac{{\left\| {{y_i} - {y_j}} \right\| _2^2}}{{2{\sigma ^2}}}} \right) {w_{ij}} \end{aligned}$$
(9)

where \(w_{ij}\) is the similarity between two samples with kernel function. Let \(\varvec{S} = [s_{ij}]^{u \times u}\) denote the symmetric similarity matrix for the unlabeled data, and \(s_{ij}\) is the consistency between \(x_i\) and \(x_j\) sample-label pairs points. With such a consistency matrix, the representativeness is to find the sample that can well represent the unlabeled data set. To do so, [8] proposed a convex optimization framework by introducing variables \(p_{ij}\in [0,1]\) which indicates the probability that \(x_i\) represents \(x_j\). In our consistency measurement based on MCC, if \(x_i\) can represent the point \(x_j\), and it cannot represent the point \(x_t\), there will be \(s_{ij}\gg s_{it}\). Such a consistency measurement has already made the difference between representatives and non-representatives large. Therefore, we define that if \(x_i\) is the representative one, the probabilities \(p_{ij}, j=1,2,..u\) between \(x_i\) and the other unlabeled samples are 1, otherwise \(p_{ij},j=1,2,..u\) are 0. Equally, we define \(\varvec{d} = [d_{ij}]^{u\times l}\) and \(\varvec{z} = [z_{ij}]^{u\times l}\) as the consistency matrix and probability between the unlabeled data and the labeled data respectively. By querying a desirable sample, which can not only represent the unlabeled data and but also not overlap the information in labeled data, we maximize the expectation operator and use a tradeoff parameter \(\beta \) to measure and balance the representative information in unlabeled data and labeled data

$$\begin{aligned} E\left[ {x{}_q,U,L} \right] = \mathop {\max }\limits _{{x_q}} \sum \limits _{{x_q} \in U} {\left[ {\left( {\frac{1}{u}\sum \limits _{{x_j} \in U} {{s_{qj}}{p_{qj}}} } \right) - \beta \left( {\frac{1}{l}\sum \limits _{{x_j} \in L} {{d_{qj}}{z_{qj}}} } \right) } \right] } \end{aligned}$$
(10)

In current research, it has proved that the combination between uncertainty and representativeness is very effective in active learning [8, 15]. In our approach, we also combine them with a tradeoff parameter, given by

$$\begin{aligned} \begin{aligned} \mathcal {L}(x_q,f,L)+\beta _0E[x_q,U,L] \end{aligned} \end{aligned}$$
(11)

To merge the representative part into uncertain part, we use the prediction labels of unlabeled data. For each classifier \(f_k\), we define \(f_k(x)\) with a linear regression model in the kernel space as \(f_k(x)=\omega _k^T\varPhi (x)\) for each label, where \(\varPhi (x)\) is the feature mapping to the kernel space. In (11), the specific point \(x_q\) can be queried from the unlabeled data, but exhaustive search is not feasible due to the exponential nature of the search space. To solve such a problem, we use the numerical optimization-based techniques. An indicator vector \(\varvec{\alpha }\) is introduced, which is a binary vector with u length. Each entry \(\alpha _j\) denotes whether the corresponding sample \(x_j\) is queried as the query sample. If \(x_j\) is queried as \(x_q\), \(\alpha _j\) is 1, otherwise, \(\alpha _j\) is 0. Then the objective function can be defined as

$$\begin{aligned} \begin{aligned} \mathop {\arg \max }\limits _{\omega ;{\alpha ^T}1 = 1,{\alpha _i} \in \left\{ {0,1} \right\} } \sum \nolimits _{{x_i} \in L} {\sum \limits _{k = 1}^C {\exp \left( { - \frac{{{{\left\| {{y_{ik}} - \omega _k^T\varPhi \left( {{x_i}} \right) } \right\| }^2}}}{{2{\sigma ^2}}}} \right) } } - \lambda \sum \limits _{k = 1}^C {{{\left\| {{\omega _k}} \right\| }^2}}\\ + \sum \limits _{{x_j} \in U} {{\alpha _j}\sum \limits _{k = 1}^C {\exp \left( { - \frac{{\left( {1 + 2\left| {\omega _k^T\varPhi \left( {{x_j}} \right) } \right| + {{\left( {\omega _k^T\varPhi \left( {{x_j}} \right) } \right) }^2}} \right) }}{{2{\sigma ^2}}}} \right) } } \\ + {\beta _1}\sum \limits _{{x_j} \in U} {{\alpha _j}\left( {\frac{1}{u}} \right) \sum \limits _{{x_i} \in U} {\exp \left( { - \frac{{\left\| {{\varvec{\omega } ^T}[\varvec{I}\otimes \varPhi \left( {{x_j}} \right) ] - {\varvec{\omega } ^T}[\varvec{I}\otimes \varPhi \left( {{x_i}} \right) ]} \right\| _2^2}}{{2{\sigma ^2}}}} \right) {w_{ji}}} }\\ - {\beta _2}\sum \limits _{{x_j} \in U} {{\alpha _j}\left( {\frac{1}{l}} \right) \sum \limits _{{x_i} \in L} {\exp \left( { - \frac{{\left\| {{\varvec{\omega } ^T}[\varvec{I}\otimes \varPhi \left( {{x_j}} \right) ] - {y_i}} \right\| _2^2}}{{2{\sigma ^2}}}} \right) {w_{ji}}} } \end{aligned} \end{aligned}$$
(12)

where \(\varvec{\omega } = \{\omega _1,\omega _2,\ldots ,\omega _C\}\) is the multi-label classifier. \(\varvec{I}\) is the identify matrix of size \({C\times C}\), and \(\otimes \) is the kronecker product between matrices. Although the objective function (12) is neither convex nor linear, we derive an iterative algorithm based on half-quadratic technique [11, 13] with the alternating optimization strategy [2] to solve it efficiently. Based on the theory of convex conjugated functions [3], we can easily derive the following proposition [40].

Proposition 1

A convex conjugate function \(\varphi \) is exiting to make sure

\(g\left( x \right) = \exp \left( { - \frac{{{x^2}}}{{2{\sigma ^2}}}} \right) = \mathop {\max }\limits _{p'} \left( {p'\frac{{{{\left\| x \right\| }^2}}}{{{\sigma ^2}}} - \varphi \left( {p'} \right) } \right) \)

where \(p'\) is the auxiliary variable, and with a fixed x, g(x) reaches the maximum value at \(p' = -g(x)\).

According to the Proposition 1, the objective function (12) can be formulated as

$$\begin{aligned} \begin{aligned} \mathop {\arg \min }\limits _{\omega ;{\alpha ^T}1 = 1,{\alpha _i} \in \left\{ {0,1} \right\} } \sum \nolimits _{{x_i} \in L} {\sum \limits _{k = 1}^C {\left[ {{m_{ik}}{{\left\| {{y_{ik}} - \omega _k^T\varPhi \left( {{x_i}} \right) } \right\| }^2}} \right] } } + \lambda \sum \limits _{k = 1}^C {{{\left\| {{\omega _k}} \right\| }^2}} \\ + \sum \limits _{{x_j} \in U} {{\alpha _j}\sum \limits _{k = 1}^C {\left[ {{n_{jk}}\left( {1 + 2\left| {\omega _k^T\varPhi \left( {{x_j}} \right) } \right| + {{\left( {\omega _k^T\varPhi \left( {{x_j}} \right) } \right) }^2}} \right) } \right] } } \\ - {\beta _1}\sum \limits _{{x_j} \in U} {{\alpha _j}\left( {\frac{1}{u}} \right) \sum \limits _{{x_i} \in U} {{h_{ji}}\left\| {{\varvec{\omega } ^T}[\varvec{I}\otimes \varPhi \left( {{x_j}} \right) ] - {\varvec{\omega } ^T}[\varvec{I}\otimes \varPhi \left( {{x_i}} \right) ]} \right\| _2^2{w_{ji}}} } \\ + {\beta _2}\sum \limits _{{x_j} \in U} {{\alpha _j}\left( {\frac{1}{l}} \right) \sum \limits _{{x_i} \in L} {{v_{ji}}\left\| {{\varvec{\omega } ^T}[\varvec{I}\otimes \varPhi \left( {{x_j}} \right) ] - {y_i}} \right\| _2^2{w_{ji}}} } \end{aligned} \end{aligned}$$
(13)

where \(m_{ik}, n_{jk}, h_{ji}\), and \(v_{ji}\) are the auxiliary variables, with

$$\begin{aligned} \begin{aligned} {m_{ik}} = \exp \left( { - \frac{{{{\left\| {{y_{ik}} - \omega _k^T\varPhi \left( {{x_i}} \right) } \right\| }^2}}}{{2{\sigma ^2}}}} \right) ,{x_i} \in L,{y_{ik}} \in {y_i}\\ {n_{jk}} = \exp \left( { - \frac{{\left( {1 + 2\left| {\omega _k^T\varPhi \left( {{x_j}} \right) } \right| + {{\left( {\omega _k^T\varPhi \left( {{x_j}} \right) } \right) }^2}} \right) }}{{2{\sigma ^2}}}} \right) ,{x_j} \in U\\ {h_{ji}} = \exp \left( { - \frac{{\left\| {{\varvec{\omega } ^T}[\varvec{I}\otimes \varPhi \left( {{x_j}} \right) ] - {\varvec{\omega } ^T}[\varvec{I}\otimes \varPhi \left( {{x_i}} \right) ]} \right\| _2^2}}{{2{\sigma ^2}}}} \right) ,{x_i},{x_j} \in U\\ {v_{ji}} = \exp \left( { - \frac{{\left\| {{\varvec{\omega } ^T}[\varvec{I}\otimes \varPhi \left( {{x_j}} \right) ] - {y_i}} \right\| _2^2}}{{2{\sigma ^2}}}} \right) ,{x_j} \in U,{y_i} \in y \end{aligned} \end{aligned}$$

The objective function (13) can be solved by the alternating optimization strategy. Firstly, we fix \(\varvec{\alpha }\), and the objective function is to find the optimal classifier \(\varvec{\omega }\). It can be solved by the alternating direction method of multipliers (ADMM) [4, 24]. Secondly, we fix \(\varvec{\omega }\) that is obtained in the first step, the objective function becomes

$$\begin{aligned} \mathop {\arg \max }\limits _{{\varvec{\alpha } ^T}1 = 1,{\alpha _i} \in \left\{ {0,1} \right\} } {\varvec{\alpha } ^T}a + {\beta _1}{\varvec{\alpha } ^T}b - {\beta _2}{\varvec{\alpha } ^T}c \end{aligned}$$
(14)
$$\begin{aligned} where \begin{aligned} {a_j} = \sum \limits _{k = 1}^C {\exp \left( { - \frac{{\left( {1 + 2\left| {\omega _k^T\varPhi \left( {{x_j}} \right) } \right| + {{\left( {\omega _k^T\varPhi \left( {{x_j}} \right) } \right) }^2}} \right) }}{{2{\sigma ^2}}}} \right) }\\ {b_j} = \frac{1}{u}\sum \limits _{{x_i} \in U} {\exp \left( { - \frac{{\left\| {{\varvec{\omega } ^T}[\varvec{I}\otimes \varPhi \left( {{x_j}} \right) ] - {\varvec{\omega } ^T}[\varvec{I}\otimes \varPhi \left( {{x_i}} \right) ]} \right\| _2^2}}{{2{\sigma ^2}}}} \right) {w_{ji}}}\\ {c_j} = \frac{1}{l}\sum \limits _{{x_i} \in L} {\exp \left( { - \frac{{\left\| {\varvec{\omega } ^T[\varvec{I}\otimes \varPhi \left( {{x_i}} \right) ] - {y_i}} \right\| _2^2}}{{2{\sigma ^2}}}} \right) {w_{ji}}} \end{aligned} \end{aligned}$$

To solve (14), as in [5], we relax \(\alpha _j\) to a continuous range [0, 1]. Thus, the \(\varvec{\alpha }\) can be solved with a linear program. The sample corresponding to the largest value in \(\alpha \) will be queried as \(x_q\). The RMLAL algorithm is summarized in Algorithm 1.

figure a

4 Experiments

4.1 Settings

In this section, we present the experimental results to validate the effectiveness of the proposed method that compares with the prior methods on 9 multi-label data sets from Mulan projectFootnote 1. The characteristics of data sets are described in Table 1. To demonstrate the superior of our method, several methods are listed as follows as competitors.

  1. 1.

    RANDOM is the baseline which randomly selects instance label pairs.

  2. 2.

    AUDI [16] combines label ranking with threshold learning, then exploits both uncertainty and diversity in the instance space as well as the label space.

  3. 3.

    Adaptive [20] combines the max-margin prediction uncertainty and the label cardinality inconsistency as the criterion for active selection.

  4. 4.

    QUIRE [15] provides a systematic way for measuring and combining the informativeness and representativeness of an unlabeled instance by incorporating the correlation among labels.

  5. 5.

    Batchrank [5] selects the best query with an NP-hard optimization problem based on the mutual information.

  6. 6.

    RMLAL: Robust Multi-label Active Learning is the proposed in this paper.

LC is the average number of label for each instance in the data set. For each data set, we randomly divide it into two equal parts. One is regarded as the testing data set. For the other part, we randomly select 4 % samples as the initial labeled set, and the remaining samples of this part are used as the unlabeled data set for active learning. In the compared methods, AUDI and QUIRE query the relevance of an instance-label pairs in each iteration. We can notice that querying all labels for one instance is equal to query C label-instance pairs. Hence, for fair comparison, we query C label-instance pairs as one query instance in AUDI and QUIRE. For the method Batchrank, in the original paper, the tradeoff parameter sets as 1. For a fair comparison, we choose the tradeoff parameter from a candidate set that is the same in the proposed method. The parameters of other methods are all set as the same in original papers. For the kernel parameters, we adopt the same value for all methods.

Fig. 4.
figure 4

Comparison of different active learning methods on fifteen benchmark datasets. The curves show the micro-F1 accuracy over queries, and each curve represents the average result of 5 runs.

Without loss of generality, the liblinearFootnote 2 is adopted as the classifier for all methods [9], and micro-F1 is used to evaluate the performance [5], which is a commonly used performance measurement in multi-label learning. For each data set, we repeat each method for 5 times and report the average results. The querying process stops when 100 iterations are reached and one instance is queried at each iteration.

4.2 Results

For each data set, we reported the average results in Fig. 4. From all these results, we could observe that the proposed method performs the best on most of the data sets. It achieved the best results in almost the whole active learning process. In general, QUIRE and AUDI were two methods to query the label-instance pairs for labeling. They almost showed the superior performance to the Batchrank and Adaptive, which queried all labels for the instance. This demonstrated that querying the relevant labels was more efficient than querying all labels for one instance. But for our methods, it achieved the best performance with querying all labels for one instance than querying the relevant label-instance pairs. The reason may be that although the Batchrank and Adaptive queried all the labels, they could not avoid the influence of the outlier labels, leading to the query samples undesirable. For QUIRE and AUDI methods, some labels information lost when they just queried the limited relevant labels, and they need much samples to achieve a better performance. The results demonstrated the proposed method not only could achieve discriminative labeling but also could avoid the influence of the outlier labels. To put in nutshell, the proposed method merging the uncertainty and representativeness with MCC can solve the problems in multi-label active learning effectively as stated above.

4.3 Evaluation Parameters

In the proposed method, the kernel parameter \(\sigma \) is very important for the MCC. There are two tradeoff parameters on the uncertain part and representative part respectively. For conveniently, in our experiments, we defined kernel size \(\gamma = 1/(2*\sigma ^2)\), and we fixed the kernel size as 1 / C in the label space. For the feature space, we fixed the kernel size as 1 / dim in feature space, where dim is the dimension of feature space. To discover the influence of the kernel size for the proposed method, we evaluated the kernel size for MCC in label space. We reported the average results when the kernel size was set as \(\{\gamma , 2\gamma , 4\gamma \}\) respectively on two popular benchmark datasets emotions and scence [5], which had the same number of labels but with different LC. For the tradeoff parameters, we chose them from a fixed candidate set \(\{1, 10, 100\}\) respectively, and we also reported the average results on the two data sets. The other settings were same to the previous experiments. Figure 5 showed the average results with the kernel size changing. We can observe that the results are not very sensitive to the kernel size. This may be that the changing of the parameter \(\gamma \) just changes the relative value of the discriminative labels and outlier labels with MCC, but the value of discriminative labels with MCC are always larger than that of outlier labels. Relatively, we can set the kernel size as double of \(\gamma \) for a better selection. Figure 6 showed the results with different pairs of the tradeoff parameters. For these results, we can observe that uncertain information and representative information have a big influence on the results. However, the better results are obtained in contrast on the two data sets. The scence data obtains the good results when \(\beta _1\) is small and \(\beta _2\) is large, while the emotions data obtained the good results when \(\beta _1\) is large and \(\beta _2\) is small. This may be that the LC of scence is small than the emotions data, leading to the initial labeled information of scence is less. With so little supervised information, the labeled data become important to build a query model. When LC is large, the supervised information may be redundant, and the unlabeled data become important. Therefore, the tradeoff parameters can be adopted according to the different data sets adaptively with LC.

Fig. 5.
figure 5

Comparison of different \(\gamma \) on two data sets

Fig. 6.
figure 6

Comparison of different tradeoff parameter pairs \((\beta _1,\beta _2)\) on two data sets

5 Conclusion

Outlier labels are very common in multi-label scenarios and may cause the supervised information bias. In this paper, we propose a robust multi-label active learning based on MCC to solve the problem. The proposed method queries the samples that can not only build training models with a good generalization ability but also represent the similarity well for multi-label data. With MCC, the supervised information of outlier labels will be suppressed, and that of discriminative labels will be expanded. It outperformed state-of-the-art methods in most of the experiments. The experimental analysis also reveals that it is beneficial to update the trade-off parameter that balances the uncertain and representative information during the query process. We plan to develop an adaptive mechanism to tune this parameter automatically to make our algorithm more practical.