Keywords

1 Introduction

Extreme classification was coined by John LangfordFootnote 1 and Manik VarmaFootnote 2 in 2013. It is the emerging research field in machine learning which solves classification problems in presence of a large number of categories (which are also called classes or labels) [8]. And the number of these categories is often more than \(10^{5}\). To be specific, extreme classification consists of extreme multi-class (only one label is correct) or multi-label predictions (more than one label is relevant to the given item).

In this work, we focus on extreme multi-label classification where the label set has dimensionality of the order of hundreds of thousands or even millions, because this task has been of more and more significance in real-world applications such as text tagging. The goal in extreme multi-label classification is to learn a classifier which can annotate a new instance with relevant labels from the extremely large label set. Take web tagging as an example, the pages in Wikipedia are all tagged with several relevant labels. Extreme multi-label classification can be used to learn a classifier to automatically label new pages by training on the existing pages. Furthermore, extreme multi-label classification can effectively address machine learning problems in web-scale data mining, such as recommendation systems and ad landing pages’ queries [1, 10, 11]. Due to its capability for dealing with web-scale data, extreme multi-label classification has attracted more and more attention in recent years.

The popular approaches to extreme multi-label classification can be divided into two categories, namely 1-vs-all approaches [2, 9, 12, 13] and tree-based approaches [5, 6, 10, 11]. 1-vs-all approaches train a classifier for each label and they usually take months to train on large datasets on a standard desktop [11]. It is intolerable since extreme multi-label classification has been applied in real-world applications such as recommendation systems and ad landing pages’ queries which are required to quickly predict the labels of items and give users an immediate answer. To overcome this, DiSMEC [2] and PPDSparse [12] take advantage of distributed systems and partition the training jobs on several computing nodes. Although it is effective, the cost of hardware is heavy. Taking dataset, WikiLSHTC-325KFootnote 3, as an example, it has 1,778,351 training instances and 325,056 categories. On this dataset, DiSMEC needs 3 h train on 1000 cores. While for PPDSparse, it takes much shorter training time (i.e. 353 s on 100 cores). If we reduce the hardware cost and train on a single core, tree-based approaches can train much faster. For example, PfastreXML [5] only needs 7.42 h on a single core relative to 749 h for DiSMEC. However, tree-based approaches have not been parallel to accelerate the training process. So is Parabel which is the fastest 1-vs-all approach built with tree structure [11]. To overcome this, we analyze the data independence between nodes and propose PParabel method to accelerate the training process which is the fastest method on one core [11].

Our contributions are shown as follows:

  • We analyze the hierarchy of Parabel and find that each label only exists in one node on the same level which means nodes on the same level have data independence. With data independence, we can make the training process parallel at each level.

  • We parallelize the training process in two stages. In the first stage, we parallelize the training process of nodes on the same level. In the second stage, we parallelize the k-means in nodes with OpenMP according to the number of labels in nodes.

  • We conduct our training process in a thread-level parallelism way and apply OpenMP to accelerate our training. We can enable PParabel work on standard desktops to minimize hardware costs.

  • We shorten the training time from one day to just one hour without appending more machines.

The rest of the paper is organized as follows. Section 2 introduces the existing approaches to extreme multi-label classification. Section 3 describes the detail of our proposed PParabel method. Section 4 reports our experiments and we will analyze the results. At the end of this paper, we conclude our work and indicate future directions.

2 Related Work

The existing approaches to solving the extreme multi-label classification task can be divided into four categories, namely 1-vs-all approaches [2, 9, 12, 13], label-embedding approaches [3, 4, 14], tree-based approaches [5, 6, 10, 11] and deep learning based approaches [7].

1-vs-all approaches: 1-vs-all approaches train a separate classifier per label on the whole dataset. This kind of approach leads to training time linear in the number of classes. Therefore, when it comes to big datasets, the training costs can be heavy [11]. But this kind of method ignores the relevance between labels which makes each label independent and easy to be parallelized. DiSMEC [2] and PPDSparse [12] took advantage of the irrelevance between labels and scaled the training progress in large-scale distributed settings. Despite it can make the method easy to be parallelized, the cost of hardware is heavy.

Label-embedding approaches: Label-embedding approaches make the assumption that label matrix is low-rank. Therefore, it can be projected into low dimensional space. In this way, effective number of labels can be reduced. However, since the training points follow a power-law distribution, it will lead to low accuracy [2]. Moreover, embedding approaches need long time for training and prediction even on small embedding dimensions, let alone large datasets.

To overcome these limitations, SLEEC [3] was proposed. It learned local embedding instead of global embedding. To be specific, it used kNN method to preserve nearest neighbors in the label space. However, SLEEC ignores to model the label structure. [15] proposed a deep embedding method for extreme multi-label classification to overcome this. The deep embedding method uses label graph to depict the label structure. In the label graph, an edge exists if the two labels are active at the same sample. With the label graph established, DeepWalk method is used to make word2vec representation for all nodes in the graph. Then the distance between features and labels can be computed and all the training points can be clustered.

Tree-based approaches: Tree-based approaches usually have two types: decision trees and label trees. FastXML [10] is a state-of-the-art classifier for extreme multi-label classification. It recursively partitioned a parent’s feature space between its children. To learn the hierarchy, FastXML optimizes the normalized Discounted Cumulative Gain (nDCG).

Another popular tree-based approach is PfastreXML [5]. The algorithm replaces the nDCG loss with its propensity scored variant. It also assigns higher rewards for accurate tail label predictions. In this way, it can improve tail label prediction which is the most challenging factor of extreme multi-label classification.

Unlike the above two methods, Parabel [11] learned a few balanced label hierarchies. The root node of each hierarchy contains the whole set of labels. Each label tree recursively partitions the nodes into two balanced nodes until the number of labels in the leaf nodes is smaller than a threshold. When it comes to leaf nodes, a classifier will be learned for each label. It can be conducted on a single core with the shortest training time while matching its prediction accuracy with other methods. Although Parabel learned balanced label trees in a parallel way, it didn’t optimize the node partition process in a parallel way to save time.

Deep learning based approaches: Deep learning based approach is a new way for extreme multi-label classification. Although it has achieved great success in other areas, it has not been applied to extreme classification until 2017. The first attempt is XML-CNN [7]. It utilizes the CNN model to learn a rich number of feature representations. Unlike the traditional CNN model, XML-CNN adopts a dynamic max pooling scheme to get more than one feature. Therefore, it can capture more fine-grained features.

Nowadays, the most popular approach is 1-vs-all approach since we can make log-time training and prediction. But how can we reduce the hardware cost and accelerate the training process is still a problem. Therefore, we are motivated to propose our solution in the next section.

Fig. 1.
figure 1

The main structure of PParabel.

3 Methodology: Parallel Partitioned Label Trees (PParabel)

Our method is designed to accelerate the node partition process parallel. There are two main components in our method, label trees and idle threads. Label trees are used for training the model. Internal nodes in label trees are processed parallel in idle threads. Figure 1 shows the main structure of PParabel. The proposed method is described in the algorithmic format in Algorithm 1. Detailed information is shown as follows. As we all know, every node in the tree which is learnt by Parabel is partitioned into two groups, and not a single label can be appeared in two groups. In other words, the split of different nodes is independent. Therefore, we can use its independence to make Parabel parallel. Each node is carried on a single thread.

For each label tree, we load feature matrix \(X=x_1,x_2,\ldots ,x_n\) and label matrix \(Y=y_1,y_2,\ldots ,y_n\). Then we represent the label representation in the way Parabel did. We average the feature vectors of instances which are positive to the label as the representation of label. With all labels represented, we put all label representations in the root node and start partitioning. We parallelize the partition process in a two-stage way which will be discussed in Sects. 3.1 and 3.2. For each node partition, we apply k-means to split the node into two nodes. We first randomly choose two label representations as centroids. After that, we calculate the distance between centroids and labels. For nodes on top levels, we parallelize the calculating distance with OpenMP. If it is not converged, we calculate the new centroids and repeat the k-means clustering. When the k-means clustering is converged, we split the labels into two nodes according to the resulting clusters. All these child nodes will be sent to idle threads to further split. The partition process will not stop until the number of labels in any leaf nodes is smaller than a threshold.

We also implement a two-stage parallelization which executes different strategies according to different nodes on different levels. In the following, we will elaborate more details about the two-stage parallelization. The first stage parallelization is applied to all nodes. The second is applied to nodes on top levels.

figure a

3.1 First-Stage Parallelization

In this stage, training process of each node is carried on a single thread. And we parallelize the training process of nodes on the same level. The notion behind is that each node is split into two nodes with totally different labels. When it comes to splitting these two child nodes, they do not affect each other. In other words, siblings do not have data dependency.

3.2 Second-Stage Parallelization

Since each tree node is halved, the training time of child nodes will also be halved. In other words, the time child nodes take for training should be half of the time their parent takes. We make the analysis to demonstrate this and we will discuss this in Sect. 4.3. To maximize the usage of threads and speedup the training process, we parallelize the k-means, which is used to split the parent node into two parts, for the nodes on top layers with OpenMP. Here, we set the first five layers as top layer. Since calculating the distance between labels and cluster centroids is the most time consuming step in k-means, we parallelize this process with OpenMP. For the rest nodes which are on the sixth or after sixth layers, since their label sets are not large enough and there is no idle thread, they do not need to parallelize the k-means.

4 Experiments

4.1 Dataset Description

Table 1. Dataset Statistics

We carry out experiments on publicly available datasets from the Extreme Classification repositoryFootnote 4. The detailed information of these datasets is shown in Table 1. All these datasets are processed from their original sources such as Wikipedia and Amazon. To figure out the effectiveness of the algorithm on different scale dataset, we choose one small dataset (EURLex-4K with 3,993 labels) and three large scale datasets (WikiLSHTC-325K, Wiki-500K and Amazon-670K) which include hundreds of thousands labels along with million train points.

4.2 Evaluation Metrics

We use precision at k and speedup as the metrics for comparison. Precision at k is a commonly used metrics in extreme multi-label classification to show the classification accuracy. And speedup is used to show the effectiveness of the parallelization. For a predicted score vector \(\hat{y} \in R^L\) and the ground truth label vector \(y\in {\left\{ 0,1\right\} }^L\), the precision at k is defined as:

$$\begin{aligned} P@k :=\frac{1}{k}\sum _{l \in rank_k\hat{(}y)} y_l \end{aligned}$$
(1)

The speedup is defined as:

$$\begin{aligned} S=\frac{T_s}{T_p} \end{aligned}$$
(2)

where \(T_s\) is the time that the experiment takes in a serial way and \(T_p\) is the time that the experiment takes in a parallel way. Higher value of S means more effectiveness of the algorithm.

4.3 Results

Fig. 2.
figure 2

Average training time of each tree level for four datasets.

Figure 2 shows the average split time of each level. As we can see, the average split time of second-layer nodes is about half of the root node. The average split time of third-layer nodes is about half of the second-layer nodes. So are the forth layer and fifth layer. When it comes to layers after fifth layer, the split time is almost the same. The reason is that the original label set has been divided into more than 32 parts and the k-means process in these nodes can be quickly converged. As we can see, k-means process is the most time consuming step in top level nodes. When it comes to other levels, a large number of node partition is the most time consuming step. Therefore, we can parallelize the k-means process with OpenMP on top levels to accelerate the training.

Fig. 3.
figure 3

Speedup of different threads on four datasets.

Table 2. Results on extreme classification datasets.

All experiments are run on two Intel Xeon E5-2620 v4 2.10 GHz CPUs. Each CPU has 8 physical cores. There are no hyper threads per core. While the proposed method is based on Parabel, the precision@k and speedup are calculated between the Parabel, PParabel and fastXML. For Parabel and PParabel, the number of balanced trees trained is three and two algorithms use squared hinge loss. But fastXML trains fifty trees in order to achieve high accuracy. Table 2 shows the results on extreme classification datasets. It turns out that the prediction accuracy of PParabel is almost the same as Parabel. It is much better than fastXML which trains much more trees to increase the accuracy. Since we just parallelize the partition process before learning classifiers, the precision@k should be the same theoretically. However, choosing random starting points may result in different clustering results. This may explain why precision@k of Parabel and PParabel are slightly different.

Table 2 also shows the training time of three algorithms which run on these datasets. It can be seen that as the number of threads increases, the training time of PParabel gets shorter and shorter. But when the number of threads increases more than 10, the training time will not decrease much. We can also reduce the training time from 27 h to 1 h just using one machine. It is great to shorten the training time while using much fewer machines.

Figure 3 shows the speedup of different threads on four datasets. The number of threads for PParabel varies from 2 to 15. Both Parabel and fastXML are run with a single thread. We set the maximum threads 15 since we want to make sure that every thread can work on different processor separately which make parallelization happen. PParabel uses multi-threads, while Parabel and fastXML use just one thread. And all these experiments are run in this situation.

The maximum speedup of Parabel is around 9 which is achieved in EURLex-4K dataset. And the maximum speedup of fastXML is around 57 which is achieved in Amazon-670K dataset. As it can be seen, the speedup gain per thread is getting down with the number of threads increasing. The reason is that to protect the data consistency, we need to block other completed threads until the current thread finishes writing. When the number of threads increases, the chance of being blocked is getting bigger and bigger which wastes a lot of time to communicate. Therefore, the speedup gain per thread is getting down with the number of threads increasing. But the speedup is achieved almost linearly with the threads number increasing. In this consideration, to maximize the performance, the optimal number of threads is around 15.

5 Conclusion

In this paper, we have discussed the hardware cost and training time of four typical kinds of approaches to extreme multi-label classification. In order to reduce the hard-ware cost and speedup the training process, we have proposed PParabel algorithm based on Parabel. Our main contribution is employing a two-stage thread-level parallelism. Moreover, we analyze the data independence of nodes on the same level to make sure the training process can be successfully parallelized. The experiment results show that our method is successful to accelerate the training process. All our experiments are conducted on a standard desktop. However, the speedup is achieved almost linearly with the thread number increasing. In the future work, we will study more sufficient approaches in thread level.