1 Introduction

In standard supervised learning, an object consists of a single instance, represented by a feature vector, and is associated with a single class label. This framework is known as single-instance single-label (SISL) learning. The goal of SISL learning is to train a classifier model which learns from training instances how to assign a class label to any feature vector. However, in many real applications, such a learning framework is less convenient to model complex objects, which intrinsic representation is a collection of instances. Likewise, these complex objects may also be associated simultaneously with multiple class labels. For example, a scene image may comprise images of mountains, lakes, and trees, and we may associate it with the labels Landscape and Summer at the same time. If we extract a single instance to represent it, some useful information may get lost. In another approach, we can segment the image into multiple regions and extract one instance from each region of interest. Another example could be in text categorization tasks where a document may be annotated with multiple labels. To fully exploit the content with multiple topics, it would be more advantageous if we represent each paragraph with one instance. Zhou and Zhang [22] introduced multi-instance multi-label (MIML) learning, where each object is represented by a bag of multiple instances (feature vectors with fixed-length), and each object is associated with a set of class labels. Several algorithms for MIML have been proposed and achieved better performance in image and text classification, in comparison to conventional methods adapted for MIML classification. Other successful applications include genome protein function prediction [18], gene expression patterns annotation [20], relationship extraction [15], video understanding [19], classification of bird species [1, 2], and predicting tags for web pages [14].

In most cases of supervised learning, it is necessary to use large amounts of training examples to obtain accurate models. Nevertheless, it is a typical situation that the costs of manually labeled data are expensive or time-consuming. Active learning is an approach of a partially-supervised learning algorithm [3, 4, 10] that reduces the required amount of training data without compromising the model performance. This goal is accomplished by selecting the most informative examples from the unlabeled examples and query their label from an oracle (expert). Pool-based sampling is the most common scenario in active learning in which queries are drawn from a static or closed pool of unlabeled examples. Many active learning strategies have been proposed to estimate the informativeness of unlabeled samples [13, 17]. These query strategies are based on different measures, e.g., uncertainty, expected error reduction and information density. A comprehensive literature survey on query strategies is provided by Settles [12].

For MIML datasets, the cost of labeled data depends on the maximum amount of possible labels for a bag of instances. In some applications, MIML provides a major advantage because it is easier or less costly to obtain labels at the bag-level than at instance-level. Nevertheless, because of their multiplicity in the input and output spaces, the required amount of training data to improve the accuracy model increases dramatically. For this reason, it is of great interest to implement active learning algorithms in a MIML framework. Currently, few studies have proposed active learning methods for MIML. Retz and Schwenker [9] use MimlSvm [23] as the base classifier in which the MIML data is reduced to a bag-level output vector. This representation is later used to formulate an active learning strategy. Another proposed method uses MimlFast as base classifiers and the approach actively queries the most valuable information by exploiting diversity and uncertainty in both the input and output spaces [5].

The efficiency of an active learning algorithm relies not only on the query strategy design but also on the selection of the base classifier. Two of the most commonly used classifiers are MimlBoost and MimlSvm [22, 23]. Nevertheless, MimlBoost can handle only small datasets and does not yield good performance in general [6]. MimlSvm reaches a satisfying classification accuracy for text and image, but usually not for other types of data sets [1, 6]. A better alternative is Miml-\(k\) nn[21] (Multi-Instance Multi-Label \(k\)-Nearest Neighbor) which combines the well-known \(k\)-Nearest Neighbor technique with MIML. Given a test example, Miml-\(k\) nn not only considers its \(\kappa \) neighbors but also considers its \(\kappa '\) citers, i.e., examples that consider the test example within their \(\kappa '\) nearest neighbors. The identification of neighbors and citers relies on the Hausdorff distance which is an estimation of the distances between bags. One advantage of using Miml-\(k\) nn with pool-based sampling is that the distance between all bags (i.e., labeled and unlabeled bags) can be precomputed and stored for later use in any model learning or prediction. Beside this, Miml-\(k\) nn classifiers have achieve a superior performance than the MimlSvm and MimlBoost for different types of data such as text [11, 21], image [21, 22], and bio-acoustic data [1].

In this paper, we introduce an active multi-instance multi-label learning approach within a pool-based scenario and use Miml-\(k\) nn as the base classifier. This method aims to reduce the amount of training MIML data needed to achieve the highest possible classification performance. This paper presents two major contributions to active learning and MIML learning. First, we motivate and introduce several new query strategies within the MIML framework. Later we conduct an empirical study of our proposed active learning methods on a variety of benchmark MIML data.

The remainder of this paper is organized as follows. Section 2 describes in detail the proposed approach. Section 3 describes the experiments and presents their results, followed by conclusions in Sect. 4.

2 Method

2.1 MIML Framework

In a MIML framework, an example \(X\) consists of a bag of instances \(X = \lbrace \mathbf {x}_{j}\rbrace _{j=1}^m\) where \(m\) is the number of instances and each instance \(\mathbf {x}_{j}= \left[ x_{1},\dots ,x_{D}\right] \) is a \(D\)-dimensional feature vector. The number of instances \(m\) can variate among bags. In this framework, each bag \(X\) can be associated to one or more labels and they are represented by a label set \(Y =\lbrace y_{k} \rbrace \) where \(k \in \lbrace 1,\dots ,K\rbrace \). For our purposes, \(Y\) is represented by a label indicator vector \(\mathbf {I} = \left[ I_{1},\dots ,I_{K}\right] \) where the entry \(I_{k}=1\) if \(y_{k}\in Y\) and \(I_k=0\) otherwise. Given a fully labeled training set \(\mathcal {L} = \lbrace \left( X_l,Y_l \right) \rbrace _{l=1}^L\), the learning task in a MIML framework is to train a classification model which is a function \(h:2^{\mathcal {X}}\rightarrow 2^{\mathcal {Y}}\) that maps a set of instances \(X\in \mathcal {X}\) to a set of labels \(Y \in \mathcal {Y}\).

MIML algorithms such as MimlSvm, MimlRbf and Miml-\(k\) nn reduce the MIML problem to a single-instance multi-label problem by associating each bag \(X\) with a bag-level feature vector \(\mathbf {z}\left( X\right) \in \mathbb {R}^K\) which combines information from the instances in the bag. Each algorithm uses different approaches to compute a bag-level feature vector. Nevertheless all these methods heavily depend on the use of some form of bag-level distance measure. The most common choice is the Hausdorff distance \(D_H\left( X,X'\right) \). Retz and Schwenker [9] examined several variations of this distance. For this paper we consider the maximum \(D_H^{max}\), median \(D_H^{med}\) and average \(D_H^{avg}\) Hausdorff distances defined as:

$$\begin{aligned} D_H^{max}\left( X,X'\right)&=\max {\bigg \lbrace \max _{\mathbf {x}\in X}{\min _{\mathbf {x}'\in X'}{d\left( \mathbf {x},\mathbf {x}'\right) }},\max _{\mathbf {x}'\in X'}{\min _{\mathbf {x}\in X}{d\left( \mathbf {x},\mathbf {x}'\right) }} \bigg \rbrace } \end{aligned}$$
(1a)
$$\begin{aligned} D_H^{med}\left( X,X'\right)&=\frac{1}{2}{\bigg ( \mathop {\hbox {median}}\limits _{\mathbf {x}\in X}{\min _{\mathbf {x}'\in X'}{d\left( \mathbf {x},\mathbf {x}'\right) }},\mathop {\hbox {median}}\limits _{\mathbf {x}'\in X'}{\min _{\mathbf {x}\in X}{d\left( \mathbf {x},\mathbf {x}'\right) }} \bigg )}\end{aligned}$$
(1b)
$$\begin{aligned} D_H^{avg}\left( X,X'\right)&=\frac{1}{_{\left| X\right| +\left| X'\right| }}{\left( \sum _{\mathbf {x}\in X}{\min _{\mathbf {x}'\in X'}{d\left( \mathbf {x},\mathbf {x}'\right) }}+\sum _{\mathbf {x}'\in X'}{\min _{\mathbf {x}\in X}{d\left( \mathbf {x},\mathbf {x}'\right) }}\right) } \end{aligned}$$
(1c)

where \(d\left( \mathbf {x},\mathbf {x}'\right) =\Vert \mathbf {x}-\mathbf {x}'\Vert \) is the Euclidean distance between instances.

2.2 MIML-\(k\)NN

In the following we describe Miml-\(k\) nn algorithm [21]. Given an example bag \(X\) and a training set \(\mathcal {L}=\lbrace \left( X_l,Y_l\right) \rbrace \), first we identify in the training bags \(\mathcal {X}_L = \lbrace X_l\rbrace \), the \(\kappa \) nearest neighbors, and the \(\kappa '\) citers of \(X\) by employing the Hausdorff metric \(D_{H}^{}\left( X,X'\right) \). This means that we have to identify the neighbors set \(\mathcal {N}_\kappa \left( X\right) \) and the citers set \(\mathcal {C}_{\kappa '}\left( X\right) \). These sets are defined as follows

$$\begin{aligned} \mathcal {N}_\kappa \left( X\right)&= \lbrace A | A \text { is one of }\,X\text {'s}\, \kappa \, \text { nearest neighbors in } \mathcal {X}_{\mathcal {L}}\rbrace \end{aligned}$$
(2a)
$$\begin{aligned} \mathcal {C}_{\kappa '}\left( X\right)&= \lbrace B|X \text { is one of }\,B\text {'}\,\text {s} \kappa '\,\text { nearest neighbors in }\mathcal {X}_{\mathcal {L}}\cup \lbrace X \rbrace \rbrace \end{aligned}$$
(2b)

The citers bags are the bags that consider \(X\) to be one of their \(\kappa '\) nearest neighbors. After the computation of \(\mathcal {N}_\kappa \left( X\right) \) and \(\mathcal {C}_{\kappa '}\left( X\right) \), we defined a labeling counter vector \( \mathbf {z}\left( X\right) = \left[ z_1\left( X\right) ,\dots ,z_K\left( X\right) \right] \) where the entry \(z_k\left( X\right) \) is the number of bags in \(\mathcal {Z}\left( X\right) =\mathcal {N}_{\kappa }\left( X\right) \cup \mathcal {C}_{\kappa '}\left( X\right) \) that include label \(y_k\) in their label set. Using the binary label vector \(\mathbf {I}\left( X\right) \), \(\mathbf {z}\left( X\right) \) is defined as

$$\begin{aligned} \mathbf {z}\left( X\right) = \sum _{X'\in \mathcal {Z}\left( X\right) }{\mathbf {I}\left( X'\right) } \end{aligned}$$
(3)

Later, the information contained in \( \mathbf {z}\left( X\right) \) is used to obtain the predicted label set \(\hat{Y}\) associated to \(X\) by employing a prediction function \(\mathbf {f}\left( X\right) = \left[ f_1\left( X\right) ,\dots ,f_K\left( X\right) \right] \) such that

$$\begin{aligned} f_k\left( X\right) = \mathbf {w}_k^{\top }\cdot \mathbf {z}\left( X\right) \end{aligned}$$
(4)

where \(\mathbf {w}_k^{\top }\) is the \(k\)th transposed column of the weight matrix \(\mathbf {W}= \left[ \mathbf {w}_1,\dots ,\mathbf {w}_K\right] \). The classification rule is that the label \(\hat{y}_k\) belongs to the predicted label set \(\hat{Y}\left( X\right) =\lbrace \hat{y}_k\rbrace \) only if \(f_k\left( X\right) > 0\). Hence, for the predicted indicator vector \(\hat{\mathbf {I}}\left( X\right) =\left[ \hat{I}_{1},\dots ,\hat{I}_{K}\right] \) the entry \(\hat{I}_{k}=1\) if \(f_k\left( X\right) >0\) and \(\hat{I}_{k}=0\) otherwise. The values of \(\mathbf {W}\) are computed using a linear classification approach by minimizing the following sum-of-squares error function

$$\begin{aligned} E = \frac{1}{2}\sum _{l=1}^{L}\sum _{k=1}^K\left( w_K^{\top }\cdot \mathbf {z}\left( X_l\right) - y_k\left( X_l\right) \right) ^2 \end{aligned}$$
(5)

This error minimization implies to solve the weight matrix \(\mathbf {W}\) as in a least sum-of-squares problem of the form \( \left( \mathbf {Z}^\top \mathbf {Z}\right) \mathbf {W} = \mathbf {Z}^\top \mathbf {Y}\). In this case, the matrix \(\mathbf {W}\) is computed using a linear matrix inversion technique of singular value decomposition.

2.3 Active Learning

In this part, we present the strategies of active learning for a multi-instance multi-label data set using Miml-\(k\) nn as the base classifier. Initially we have a set of labeled data \(\mathcal {L}=\lbrace \left( X_l,Y_l\right) \rbrace \) with \(L\) labeled bags and a set of unlabeled data \(\mathcal {U}=\lbrace X_u \rbrace \) with \(U\) unlabeled bags. In an active learning scenario, usually the amount of unlabeled data is much larger than the amount of labeled data, i.e. \( U\gg L\). The main task of an active learning algorithm is to select the most informative bag \(X^*\) according to some query strategy \(\phi \left( X\right) \), which is a function evaluated on each example \(X\) from the pool \(\mathcal {U}\). In this work, the selection of the bag \(X^*\) is done according to

$$\begin{aligned} X{^*} = \underset{X\in \mathcal {U}}{{\text {arg max}}}\,\phi \left( X\right) \end{aligned}$$
(6)
figure a

Algorithm 1 describes the pool-based active learning algorithm for training a Miml-\(k\) nn model. One advantage of using Miml-\(k\) nn with pool-based sampling, is that, the distance between all bags (i.e. labeled and unlabeled bags) can be precomputed and stored for later use in any model learning or prediction task. As in Algorithm 1, first we calculated the bag distance matrix \(\mathbf {D}\) such that \(d_{ij}= D_H\left( X_i,X_j\right) \) for all bags \(X_i,X_j\). Then from this matrix we can extract the distance submatrix \(\mathbf {D}_\mathcal {L}\) of the labeled bags and use it in the training of a Miml-\(k\) nn model (see Eq. 5). For classification of the bag \(X\), we have to feed the trained Miml-\(k\) nn model with the subtracted matrix \(\mathbf {D}_{\mathcal {L}\cup \lbrace X \rbrace }\) (see Eq. 2). In the following, we describe in detail the query strategies we proposed which will be later compared in an empirical study.

Uncertainty Sampling (Unc). This approach is one of the most common in SISL framework. Here a learner queries the instance that is most uncertain how to label. For a muti-label problem we define the uncertainty as \(\phi \left( X\right) = 1-P(\hat{Y}|X)\) where \(P(\hat{Y}|X)\) is the bag posterior probability for the predicted label set \( \hat{Y}\) given the bag \(X\). We calculate \(P(\hat{Y}|X)\) as the probability given the combination of labels \(\hat{y}_k\) founded in \( \hat{Y}\left( X\right) \). For this we use a single-label posterior probability \(P\left( \hat{y}_k|X\right) \) to estimate the uncertainty \(\phi \left( X\right) \) as

$$\begin{aligned} \phi \left( X\right) = 1 -\prod _{\hat{y}_k\in \hat{Y}}^{} P\left( \hat{y}_k|X\right) \end{aligned}$$
(7)

The Miml-\(k\) nn classifier output for the \(k\)th label is a prediction function \(f_k\left( X\right) \). This function outputs higher positive or lower negative values for very certain positive or negative predictions respectively. Considering Eq. 4, this means that when \(\left| f_k\left( X\right) \right| \gg 0\) the vectors \(\mathbf {w}_k^{\top }\) and \(\mathbf {z}\) are linearly codependent. For the most uncertain label prediction then \(\left| f_k\left( X\right) \right| \approx 0\) which means that \(\mathbf {w}_k^{\top }\) and \(\mathbf {z}\) are linearly independent. Based on this, we estimate \(P\left( \hat{y}_k|X\right) \) using a normalization on \(f_k\left( X\right) \) using the Cauchy–Schwarz inequality as follows

$$\begin{aligned} P\left( \hat{y}_k|X\right) = \frac{1}{2}\left( \frac{\mathbf {w}_k^{\top }\cdot \mathbf {z}\left( X\right) }{\Vert \mathbf {w}_k^{\top }\Vert \Vert \mathbf {z}\left( X\right) \Vert } + 1\right) \end{aligned}$$
(8)

Diversity (Div). This method is based on the multi-label active learning method proposed by Huang et al. [5, 6]. This method considers that the most informative bags are those where the number of predictions are inconsistent with the average of predicted labels in the training set. Using the indicator vector \(\mathbf {\hat{I}}\left( X\right) \), \(\phi \left( X\right) \) is formulated as follows

$$\begin{aligned} \phi \left( X\right) = \left| \frac{1}{K}\sum _{k=1}^{K}\hat{I}_k\left( X\right) -\rho _{\mathcal {L}}\right| \end{aligned}$$
(9)

where

$$\begin{aligned} \rho _{\mathcal {L}}= \frac{1}{LK}\sum _ {l=1}^{L}\sum _{k=1}^{K}I_k\left( X_l\right) \end{aligned}$$
(10)

Margin (Mrg). A high positive (or low negative) value of \(f_k(X)\) means that the model has a high certainty that \(X\) positively (or negatively) belongs to the \(k\)th class. Meanwhile lower absolute values in \(f_k(X)\) indicate a high uncertainty. This strategy chooses the bag which average output values are the nearest to zero. This means

$$\begin{aligned} \phi \left( X\right) = -\frac{1}{K}\sum _{k=1}^{K} \left| f_k\left( X\right) \right| \end{aligned}$$
(11)

Range (Rng). This method is similar to the margin query strategy. In this case is considered that lower range of output values \(f_k\left( X\right) \) indicates higher uncertainty. This strategy is defined as

$$\begin{aligned} \phi \left( X\right) =-\left( \max _{k} f_k\left( X\right) - \min _{k} f_k\left( X\right) \right) \end{aligned}$$
(12)

Percentile (Prc). This approach is related to ExtMidSelect used by Retz und Schwenker [9]. This method measures the distance between the upper and lower values of \(\mathbf {f}\left( X\right) = \left[ f_{1},\dots ,f_{K}\right] \) delimited by the percentile value \(F_p\left( X\right) = \hbox {percentile}{\left( \mathbf {f}\left( X\right) ,p\right) }\) at the percentage \(p= 100\left( 1-\rho _{\mathcal {L}}\right) \%\), see Eq. 10. The strategy is defined as

$$\begin{aligned} \phi \left( X\right) =-\left| F_{\uparrow }\left( X\right) - F_{\downarrow }\left( X\right) \right| \end{aligned}$$
(13)

where \(F_{\uparrow }\left( X\right) \) and \(F_{\downarrow }\left( X\right) \) are respectively the conditional means of the upper and lower values, this means \(F_{\uparrow }\left( X\right) = E\left[ \mathbf {f}\left( X\right) |f_{k} \ge F_p\right] \) and \(F_{\downarrow }\left( X\right) = E\left[ \mathbf {f}\left( X\right) |f_{k} <F_p\right] \).

Information Density (IDC & IDH). It has been suggested that uncertainty based strategies for SISL are prone to querying outliers. To address this problem, Settles et al. [13] proposed a strategy that favors uncertain samples nearest to clusters of unlabeled samples. This strategy uses a similarity measure \(S\left( X\right) \) and an uncertainty sampling \(\phi _u\left( X\right) \) such that

$$\begin{aligned} \phi \left( X\right) = \phi _u\left( X\right) \cdot S\left( X\right) \end{aligned}$$
(14)

The uncertainty factor \(\phi _u\left( X\right) \) is formulated as in Eq. 7. We defined two types of similarity measures. The first approach (IDC) is based on a cosine distance using the formula

$$\begin{aligned} \cos \left( X,X'\right)&= \frac{\tilde{\mathbf {x}} \cdot \tilde{\mathbf {x}}'}{\Vert \tilde{\mathbf {x}} \Vert \Vert \tilde{\mathbf {x}}' \Vert } \end{aligned}$$
(15)

where \(\tilde{\mathbf {x}}\) is a bag-level vector that is the mean of features over all instances \(\mathbf {x}_j \in X\), this is \(\tilde{\mathbf {x}} ={(1/m)}{\sum ^m_{j=1}\mathbf {x}_j}\) where \(m=|X|\). The similarity measure based on cosine distance is defined as

$$\begin{aligned} S\left( X\right) = \frac{1}{U} \sum _{X' \in \ \mathcal {U}}^{} \cos \left( X,X'\right) \end{aligned}$$
(16)

The second approach (IDH) is based on the Hausdorff distance from Eq. 1. The similarity measure is defined as

$$\begin{aligned} S\left( X\right) = 1 - \frac{\exp {\left( \bar{D}_U\left( X\right) \right) }}{\displaystyle \sum _{X' \in \ \mathcal {U}}\exp {\left( \bar{D}_U\left( X'\right) \right) }} \end{aligned}$$
(17)

where \(\bar{D}_\mathcal {U}\left( X\right) \) is the mean distance between the bag \(X\) and the unlabeled bags, this is \(\bar{D}_\mathcal {U} \left( X\right) = (1/U)\sum _{u=1}^{U} D_{H}\left( X,X_u\right) \). In order to have comparable measures we applied on \(\bar{D}_\mathcal {U}\left( X\right) \) a softmax averaging.

Table 1. Statistics on data sets used in experiments

3 Experiments

We conduct a series of experiments to compare the performance of each of the query strategies presented in this work. We employed five MIML benchmark datasets including Birds [1, 2], Reuters [11], Scene [22], CK+ [7, 8] and UnitPro(G.s.) [16, 18]. A summary of the datasets is presented in Table 1. All data sets are publicly available and prepared as MIML datasets except for the CK+ dataset. We extracted this last one from the Cohn-Kanade dataset and the labels correspond to action units categories. A bag represents an image sequence and we extracted appearance based (local binary patterns) and shape based (histogram of oriented gradients) features at each image. UnitPro(G.s.) dataset is a complete proteome of the bacteria Geobacter sulfurreducens downloaded from the UniProt databank [16].

For each dataset, we randomly sample \(20\%\) of bags as the test data, and the rest as the unlabeled pool for active learning. Before the active learning tasks, \(5\%\) of the unlabeled pool is randomly labeled to train an initial Miml-\(k\) nn model. After each query, we train a Miml-\(k\) nn model with the extended labeled data and we test the performance of this model on the test set. Additionally, we run an experiment with a bag random sampling and use it as a reference. We run each experiment until we label \(50\%\) of the original unlabeled pool. In the experiments, a simulated Oracle provides the labels requested. We repeat the experiment \(30\) times for each of the datasets. The performance of the Miml-\(k\) nn models using active learning was estimated with eight measures: hamming loss, ranking loss, coverage, one error, average accuracy, average precision, average recall and average \(f_1\)-measure (see [1, 22, 23]). These measures are common performance metrics for evaluation in MIML framework. Lower values for hamming loss, ranking loss, coverage and one error imply a better performance and vice-versa for the other four measures.

Table 2. Miml-\(k\) nn parameters
Table 3. Comparison of query strategies at 50% of data labeled. \(\uparrow \left( \downarrow \right) \) indicate that higher (lower) values imply a better performance. \(\bullet \left( \circ \right) \) indicate that the query strategy is significantly better (worse) than a random bag sampling (Rnd) based on a paired \(t\)-test at the 5% significance level \(\left( p<0.05\right) \).

For each data set we tuned the number of neighbors \(\kappa \), the number of citers \(\kappa '\) and the type of Hausdorff distance \(D_H\) to obtain a maximum model performance. We perform a cross-validation test over all combinations of \(\left( \kappa ,\kappa '\right) \in \lbrace 1,3,5,\dots ,75\rbrace ^2\) with \(D_H \in \lbrace D_H^{max},D_H^{avg},D_H^{med} \rbrace \). For each combination we tested 30 replicas with 20% and 80% of the data randomly selected as testing and training set respectively. At last, we selected the parameters setting that maximizes the average \(f_1\)-measure. The results of the parameter tuning are reported in Table 2.

The results of the performance experiments are shown in Table 3. The black dot (\(\bullet \)) indicates that the performance is significantly better than the bag random sampling (Rnd). The white dot (\(\circ \)) indicates the opposite case. Regarding the query strategy, we observe that among all datasets several strategies have superior performance than Rnd. The information density based approaches (IDD & IDH) in UnitPro(G.s.) and Scene have significantly worse performance. In contrast, these strategies performed better using the CK+ and Birds dataset. The best performance among all datasets is achieved by the percentile strategy (Prc) followed by margin (Mrg) and diversity (Div) strategies. Regarding the dataset, in the Reuters and UnitPro(G.s.) dataset we observe in general a remarkable performance of the strategies. In the Reuters dataset, uncertainty (Unc) and diversity (Div) strategies are significantly better for all metrics.

Figures 1 and 2 shows the performance curves as the number of labeled data increases until the stop criterion is reached (50% labeled). We show a selection of the most representative curves based on the avg. \(f_1\)-measure and hamming loss metrics. We observe in Fig. 1b that the Miml-\(k\) nn model can reach its best performance with much less labeled data (\({\sim }25\%\)) using uncertainty (Unc) or percentile (Prc) query strategies. A similar situation can be observed in Fig. 2c where the Miml-\(k\) nn reaches nearly the lowest hamming loss at approx. \(35\%\) of labeled data using the margin (Mrg) query strategy.

Fig. 1.
figure 1

Example of query strategies performance based on the average \(f_1\)-measure

Fig. 2.
figure 2

Example of query strategies performance based on the hamming loss

4 Conclusion

In this paper we proposed an active learning approach to reduce the labeling cost of the MIML dataset using Miml-\(k\) nn as base classifier. We introduced novel query strategies and also implemented previously used query strategies for MIML learning. Finally, we conducted an experimental evaluation on various benchmark datasets. We demonstrated that these approaches can achieve significantly improved results than no active selection for all datasets on various evaluation criteria.