Graph Representation and Semi-clustering Approach for Label Space Reduction in Multi-label Classification of Documents

  • Rafał WoźniakEmail author
  • Danuta Zakrzewska
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 935)


An increasing number of large online text repositories require effective techniques of document classification. In many cases, more than one class label should be assigned to documents. When the number of labels is big, it is difficult to obtain required multi-label classification accuracy. Efficient label space dimension reduction may significantly improve classification performance. In the paper, we consider applying graph-based semi-clustering algorithm, where documents are represented by vertices with edge weights calculated according to the similarity of associated texts. Semi-clusters are used for finding patterns of labels that occur together. Such approach enables reducing label dimensionality. The performance of the method is examined by experiments conducted on real medical documents. The assessment of classification results, in terms of Classification Accuracy, F-Measure and Hamming Loss, obtained for the most popular multi-label classifiers: Binary Relevance, Classifier Chains and Label Powerset showed good potential of the proposed methodology.


Multi-label classification Label space reduction Text mining 


  1. 1.
    Tsoumakas, G., Katakis, I., Vlahavas, I.: Effective and efficient multilabel classification in domains with large number of labels. In: Proceedings of the ECML/PKDD Workshop on Mining Multidimensional Data, Antwerp, Belgium, pp. 30–44 (2008)Google Scholar
  2. 2.
    Balasubramanian, K., Lebanon, G.: The landmark selection method for multiple output prediction. In: Proceedings of the 29th International Conference on Machine Learning, pp. 283–290. Omni Press, Edinburgh (2012)Google Scholar
  3. 3.
    Read, J., Pfahringer, B., Holmes, G.: Multi-label classification using ensembles of pruned sets. In: Proceedings of the 2008 8th IEEE International Conference on Data Mining, pp. 995–1000. IEEE Computer Society, Washington, DC (2008)Google Scholar
  4. 4.
    Bi, W., Kwok, J.: Efficient multi-label classification with many labels. In: Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, vol. 28, pp. 405–413 (2013)Google Scholar
  5. 5.
    Hsu, D., Kakade, S.M., Langford, J., Zhang, T.: Multi-label prediction via compressed sensing. In: Bengio, Y., Schuurmans, D., Lafferty, J.D., Williams, C.K.I., Culotta, A. (eds.) Advances in Neural Information Processing Systems, vol. 22, pp. 772–780. Curran Associates Inc., Vancouver (2009)Google Scholar
  6. 6.
    Herrera, F., Charte, F., Rivera, A.J., del Jesus, M.J.: Multilabel Classification. Problem Analysis, Metrics and Techniques. Springer, Cham (2016). Scholar
  7. 7.
    Zhang, M., Zhou, Z.: A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 26(8), 1819–1837 (2014)CrossRefGoogle Scholar
  8. 8.
    Woźniak, R., Ożdżyński, P., Zakrzewska, D.: Cluster analysis of medical text documents by using semi-clustering approach based on graph representation. Inf. Syst. Manag. 7(3), 213–224 (2018)Google Scholar
  9. 9.
    Glinka, K., Woźniak, R., Zakrzewska, D.: Improving multi-label medical text classification by feature selection. In: Proceedings of the 2017 IEEE 26th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises, pp. 176–181. IEEE Computer Society, Poznań (2017)Google Scholar
  10. 10.
    Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’Reilly Media Inc., Sebastopol (2009)zbMATHGoogle Scholar
  11. 11.
    Andersen, J.S., Zukunft, O.: Semi-clustering that scales: an empirical evaluation of GraphX. In: Proceedings of the 2016 IEEE International Congress on Big Data, pp. 333–336. IEEE Computer Society, San Francisco (2016)Google Scholar
  12. 12.
    Malewicz, G., et al.: Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 International Conference on Management of Data, pp. 135–146. ACM, Indianapolis (2010)Google Scholar
  13. 13.
    Ohsumed: text categorization corpus. Accessed 6 June 2018
  14. 14.
    Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998). Scholar
  15. 15.
    Weka 3: data mining software in Java. Accessed 6 June 2018
  16. 16.
    Mulan: a Java library for multi-label learning. Accessed 6 June 2018
  17. 17.
    Sechidis, K., Tsoumakas, G., Vlahavas, I.: On the stratification of multi-label data. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011. LNCS (LNAI), vol. 6913, pp. 145–158. Springer, Heidelberg (2011). Scholar
  18. 18.
    Okapi: most advanced open-source machine learning library for Apache Giraph. Accessed 6 June 2018
  19. 19.
    NetworkX: Python software for complex networks. Accessed 6 June 2018

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Institute of Information TechnologyLodz University of TechnologyŁódźPoland

Personalised recommendations