1 Introduction

Over the last few years, multi-label classification methods for multimedia handling have been more and more expected. It was caused by a large number of areas, in which technological achievements resulted in explosion of structured data, particularly in multi-label data. Modern applications, such as semantic scene classification, music categorization and many others have had to be treated with new methods adapted accordingly. There can be found several traces of multi-label origins in machine learning literature but the first noticeable multi-label problem formulation appeared in [14]. Since that time a wide range of methods and techniques for multi-label classification has been proposed. In general, multi-label classification methods may be organized into two main categories, according to taxonomy proposed in [23]: problem transformation methods and algorithm adaptation methods. Whereas the former methods transform the multi-label classification problem either into one or more single-label classification, the latter provide specific learning algorithms in order to handle multi-label data directly.

Obviously, there can be proposed some other taxonomies for multi-label classification methods, such as with respect to the application area, the size of the output problem being solved (number of concurrent labels), the size of input space (number of input attributes) or the cost function being optimized. However, the great majority of them is not applicable for relational domains and cannot process really large datasets.

Nowadays, relations between objects are commonly modelled by different kinds of networks. For instance, a video can be linked to several other relevant videos. In such settlement, a network model becomes generic base for further, different types of processing and analyses. One of them is classification of network’s nodes. It means that a node has to be assigned to one or more labels. This assignment may be accomplished by one of the classification methods, either by inference based on known profiles of these nodes (regular concept of classification) or using relational information derived from the network model. This second approach utilizes information about connections between nodes (structure of the network) and can be very useful in assigning labels to the nodes being classified. For example, it is very likely that a given video x is related to sport (label sport), if x is directly linked by many other videos about sport.

The strongest motivation behind usage of relational model is its ability to reflect relationships between correlated observations (videos). For example, in the network of videos it is possible to propagate information about the known categories of the known film to other unknown films linked from the given one. A new algorithm for video categorization is proposed in this paper. It takes advantage of the above distribution process with respect to the principle of relational influence propagation [2, 16, 20]. The realization of the algorithm stays in accordance to arising trend of data explosion in transactional systems, where enormous amount of data requires sophisticated analytical methods. There is a huge need to process big data in parallel—in clouds, especially in complex analysis like multi-label classification.

Iterative Multi-label Propagation (IMP) algorithm for relational learning in multi-label data, which is proposed and examined in the paper, facilitates processing on huge data. Section 2 covers related work while in Section 3 a proposal of MapReduce approach to relational large scale multi-label classification using label propagation in the network is explained. Section 4 contains the description of the experimental setup and obtained results. The paper is concluded in Section 5.

2 Related work

The most basic classification task—single-label classification—aims to assign an object (e.g. video) to exactly one class out of two or more possible classes. For example, a video can be categorized to exactly one of three classes: it is either (i) fully, (ii) partly or (iii) not at all about sport. The more sophisticated, multi-label classification, assigns an object to multiple classes simultaneously. it means that a video is classified to several categories, e.g. simultaneously to sport, news and politics, gaming, and science. Such set of four labels is an element of power set, i.e. all possible subsets of the label-set.

In order to accomplish the multi-label classification task, algorithms of two types have been introduced: problem transformation methods and algorithm adaptation methods. Among others the representatives of the first group are: Learning by Pairwise Comparison [7], Calibrated label ranking [8], Pruned sets [19], or RAkEL [22]. The second group of methods is represented by Bayesian multi-label classification [15], The Collective Multi-Label classifier (CML) and Collective Multi-Label with Features classifier (CMLF) [9], Ranking Support Vector Machine [6], Multi-label C4.5 decision tree [4] or Multi-label k-Nearest Neighbours [24].

The above mentioned methods either learn independent binary classifiers denoting the relevance of each class (especially problem transformation methods) or try to capture strong co-occurrence patterns and dependencies among the classes by modelling joint modes of labels or applying distinct cost functions. However, the most common approach assumes learning independent binary classifier for each class, and then infers the class labels irrespectively for each test instance. Some experiments have shown that such binary relevance classifiers are able to successfully handle multi-label data [12], especially with the simple label coding using Error Correcting Output Codes (ECOC).

Nevertheless, the mentioned above traditional machine learning techniques concentrate on identically and independently distributed data. This is not a case in real-world problems where data is relational in its nature and the important source of information is provided by the correlations reflected by the objects network structure. The recent research has focused on making use of the relational structure [17] or extended feature space [13] in order to improve the quality of prediction. The idea of multi-label classification based on the MapReduce concept was preliminary proposed in [10].

3 Relational large scale multi-label classification using MapReduce

The proposed Iterative Multi-label Propagation (IMP) algorithm for relational learning in multi-label data uses Markov random walk approach to process the information of labelled and unlabelled data represented as a graph. Recently, this idea has been applied to solve many problems, such as classification of partially labelled text [21], binary digits recognition [25], image annotation [1] or derivation of lexical relatedness between terms [18]. In general, it considers label probability distribution over the known nodes in the graph and propagates it to the unknown ones using connections between them.

In the paper, we adapt the general method proposed in [16] and introduce a new Iterative Multi-label Propagation algorithm. The algorithm assumes the accomplishment of multi-label inference by implementation of binary relevance approach. This means that each label is modelled individually in the Markov random walk. Therefore, each label from the set of possible labels (label-set) is modelled by the separate probability distribution over the known nodes. The solution of the algorithm is based on physical modelling of harmonic energy minimization introduced in [25]. The modelled function of relational influence propagation relay on the minimization of energy function depicted in (1).

Let G(V, E, W) denote a graph with vertices-nodes V (a node is a video), arcs-edges (i, j) ∈ E between pairs of nodes i, j, i ≠ j, and an n × n arcs weight matrix W containing weights w ij for each edge (i, j). Then, in such a graph, we have the energy ε for a given potential function f:

$$ \varepsilon(f)=\frac{1}{2} \sum\limits_{(i,j) \in E}{w_{ij}(f(i)-f(j))^2} $$
(1)

where f(·)—the potential of a node.

In the energy function (1), it is assumed that it converges when the labels probabilities are balances in the graph. The potential f(·) may be interpreted as the label probability, which is disseminated according to the distribution of edge weights in the graph structure.

According to [25] in a weighted graph G(V,E,W) with n = |V| vertices,the label propagation may be solved by linear (2) and (3).

$$ \forall{i,j}\in V \sum\limits_{(i,j)\in E}w_{ij} P_i = \sum\limits_{(i,j)\in E}w_{ij} P_j $$
(2)
$$ \forall{i}\in V \sum\limits_{c \in classes(i)} P_i = 1 $$
(3)

where P i denotes the probability density of classes for node i.

Let assume the set of nodes V is partitioned into labelled V L and unlabelled V U vertices, V = V L  ∪ V U . Let P u denote the probability distribution over the labels associated with vertex u ∈ V. For each node v ∈ V L , for which P v is known, a dummy node v′ is inserted such that w vv  = 1 and P v = P v . This operation is equivalent to ’clamping’ discussed in [25]. Let V D be the set of dummy nodes. Then the solution of (2) and (3) can be performed according to Iterative Multi-Label Propagation, separately for each label, see Algorithm 1.

figure d

It can be noticed, the appropriate label probability for each node is calculated in the loop (line 3 of the Algorithm 1). In this step, only the local information is only (more precisely—the neighbours u of the node v). Therefore, the calculation for the whole network can be performed in parallel using the MapReduce concept [5], as depicted in Algorithm 2. The single MapReduce iteration (the whole parallel Algorithm 2) replaces lines 2–6 in Algorithm 1. The general idea of MapReduce parallel computing is shown in Fig. 1.

figure e
Fig. 1
figure 1

The MapReduce programming model. This figure is excerpted and modified from [10]

The MapReduce approach to Iterative Multi-label Propagation algorithm consists of two consecutive phases. The Map phase takes the graph structure: all labelled and dummy nodes, then propagates their labels according to adjacency list (the nearest neighbours) and with respect to the weights of edges. The Reduce phase collects labels and their edges’ weights due to the key (here—a node) and calculates new labels. The output of the reduce phase and original adjacency list is the input for the map phase of the next iteration.

4 Experimental results

4.1 Dataset

In order to evaluate and demonstrate the proposed Iterative Multi-label Propagation algorithm the Youtube dataset [3] was utilized. The dataset was crawled using YoutubeAPI in 2008 and was partitioned into 58 chunks. There were used only these attributes from the original data that were required to create a graph structure and the multi-label categorization: video_id, age, category, related_IDs. Using related_IDs the weighted graph structure was created. The weights were distributed equally among all adjacent videos, i.e. if there were 20 related videos each of them was linked by an edge with the weight of 0.05. The set was partitioned into training set and test set using the age of each video. All objects older than 950 days were assigned to training set, the rest to the test set. The basic features of utilized data set are presented in Table 1.

Table 1 Description of the basic features of the Youtube data set

Depicted in Table 1 AvgCard measures the average number of labels associated to nodes (videos) in a given set, see (4).

$$ AvgCard(D)=\frac{1}{|D|}\sum\limits_{i=1}^{|D|}|Y_i| $$
(4)

where D denotes the video dataset and Y i the label-set associated with ith node. The density measure is calculated according to (5):

$$ density(D)=\frac{1}{|D|}\sum\limits_{i=1}^{|D|}\frac{|Y_i|}{|label{-}set|} $$
(5)

As it can be observed in (5) the density measure returns average fraction of the number of labels used to describe each of videos.

4.2 Results and discussion

Having 66% of nodes with the labels assigned and the graph with relations between nodes extracted, we can apply parallel MapReduce Algorithm 2 to the 34% of unknown nodes—videos. The processing was performed in the 30-nodes cloud from the WrUT Supercomputing Center—The PLATON Science Services Platform. One iteration in such environment took approximately 19.6 min.

As we can see in Fig. 2, the Hamming Loss measure is the smallest and the best after the fifth iteration. Simultaneously, the classification accuracy (Fig. 3) reaches its highest value after the third iteration. This reduction of classification quality after several first iterations is caused by the general idea of Algorithm 2. It propagates knowledge stored in the known nodes (learning set) and passes it over unknown nodes. However, after the third iteration 98.7% of nodes (the continuous line in Fig. 3) are already classified and accessing the rest of nodes takes many following iterations. It is caused by the structure of the data. Some nodes (videos) are linked very sparsely with the rest of the graph. It take many iterations to reach them starting from the known nodes. It should be noticed that at one iteration of Algorithm 2 only nearest neighbours of the known (already categorized) videos can be classified.

Fig. 2
figure 2

Hamming Loss measure in consecutive iterations of the algorithm

Fig. 3
figure 3

Classification Accuracy and percentage of nodes reached in consecutive iterations of the algorithm

Simultaneously, after the third iteration more and more assignments for the known nodes are being changed by the algorithm decreasing the total accuracy. It means that the algorithm classifies the unknown nodes at a given iteration but these newly categorized nodes become known nodes for the following iterations. After the first iteration the contribution of categorized videos increased from 66% (Table 1) to 88.7% (Fig. 3). Note that already after the first three iterations the algorithm reaches most of its achievements: Hamming Loss = 0.04673 (Fig. 2), Accuracy = 47.5% (Fig. 3), the percentage of classified videos = 98.7% (Fig. 3). The multi categorization at the total level of almost 50% (accuracy) is not a bad result and not rarely achievements of 30% may be treated as good ones.

Analysis of results on individual labels separately (Fig. 4), provides interesting observations. On some labels Algorithm 2 provides very good results (like Music, Autos & Vehicles, Sports, Pets & Animals)—over 0.65 of F-measure. It means that relations between videos reflected by the related_IDs attribute and utilized by the algorithm are matched well by energy model from (1), (2) and (3). Additionally, these kinds of categories are easier to be precisely recognized by humans creating the related_IDs attributes.

Fig. 4
figure 4

F-measure results for distinct labels in consecutive iterations of the algorithm

On the other hand, there are some labels like Nonprofits & Activism, Gaming, and Science & Technology which tend to occur in the isolated way—the movies categorized with these labels pretty often do not have neighbours with the same label so if they are unknown they cannot inherit the proper labels from their neighbours. It is additionally enhanced by the relatively small number of labels (categories) assigned to a single video—only 1.1 in average (see AvgCard column in Table 1).

Classification accuracy at the level of nearly 50% in such environment where relations (the related_IDs attribute) not necessarily link videos with the same label should be treated as a very good result.

5 Conclusions

A new method for multi-label categorization of videos for large-scale datasets performed by means of the MapReduce paradigm is proposed in the paper. Using parallel computing enables processing large-scale datasets in the efficient way. The idea of multi-label categorization consists in iterative propagation of known label-sets over the relations linking videos. No other information except relations and known multi-labels (multiple categories) of some videos is necessary to categorize the rest of films. The other video profiles (attributes) were not used for the purpose of classification.

The experiments carried out on over 5 million of videos crawled from the YouTube service have revealed that MapReduce parallel processing may be very efficient. Besides, only few iterations (about 3) are needed to reach the best accuracy at the level of almost 50%. Additionally, categorization of some labels is more accurate, while for the others it is hard to achieve good results. It comes mostly from the nature of the relations between videos existing in the data.

The diverse classification accuracy results obtained for individual labels could be improved by modified video crawling process, according to concept presented in [11] using labels’ distribution.