Skip to main content

Graph Representation and Semi-clustering Approach for Label Space Reduction in Multi-label Classification of Documents

  • Conference paper
  • First Online:
Computer and Information Sciences (ISCIS 2018)

Abstract

An increasing number of large online text repositories require effective techniques of document classification. In many cases, more than one class label should be assigned to documents. When the number of labels is big, it is difficult to obtain required multi-label classification accuracy. Efficient label space dimension reduction may significantly improve classification performance. In the paper, we consider applying graph-based semi-clustering algorithm, where documents are represented by vertices with edge weights calculated according to the similarity of associated texts. Semi-clusters are used for finding patterns of labels that occur together. Such approach enables reducing label dimensionality. The performance of the method is examined by experiments conducted on real medical documents. The assessment of classification results, in terms of Classification Accuracy, F-Measure and Hamming Loss, obtained for the most popular multi-label classifiers: Binary Relevance, Classifier Chains and Label Powerset showed good potential of the proposed methodology.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Tsoumakas, G., Katakis, I., Vlahavas, I.: Effective and efficient multilabel classification in domains with large number of labels. In: Proceedings of the ECML/PKDD Workshop on Mining Multidimensional Data, Antwerp, Belgium, pp. 30–44 (2008)

    Google Scholar 

  2. Balasubramanian, K., Lebanon, G.: The landmark selection method for multiple output prediction. In: Proceedings of the 29th International Conference on Machine Learning, pp. 283–290. Omni Press, Edinburgh (2012)

    Google Scholar 

  3. Read, J., Pfahringer, B., Holmes, G.: Multi-label classification using ensembles of pruned sets. In: Proceedings of the 2008 8th IEEE International Conference on Data Mining, pp. 995–1000. IEEE Computer Society, Washington, DC (2008)

    Google Scholar 

  4. Bi, W., Kwok, J.: Efficient multi-label classification with many labels. In: Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, vol. 28, pp. 405–413 (2013)

    Google Scholar 

  5. Hsu, D., Kakade, S.M., Langford, J., Zhang, T.: Multi-label prediction via compressed sensing. In: Bengio, Y., Schuurmans, D., Lafferty, J.D., Williams, C.K.I., Culotta, A. (eds.) Advances in Neural Information Processing Systems, vol. 22, pp. 772–780. Curran Associates Inc., Vancouver (2009)

    Google Scholar 

  6. Herrera, F., Charte, F., Rivera, A.J., del Jesus, M.J.: Multilabel Classification. Problem Analysis, Metrics and Techniques. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41111-8

    Book  Google Scholar 

  7. Zhang, M., Zhou, Z.: A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 26(8), 1819–1837 (2014)

    Article  Google Scholar 

  8. Woźniak, R., Ożdżyński, P., Zakrzewska, D.: Cluster analysis of medical text documents by using semi-clustering approach based on graph representation. Inf. Syst. Manag. 7(3), 213–224 (2018)

    Google Scholar 

  9. Glinka, K., Woźniak, R., Zakrzewska, D.: Improving multi-label medical text classification by feature selection. In: Proceedings of the 2017 IEEE 26th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises, pp. 176–181. IEEE Computer Society, Poznań (2017)

    Google Scholar 

  10. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’Reilly Media Inc., Sebastopol (2009)

    MATH  Google Scholar 

  11. Andersen, J.S., Zukunft, O.: Semi-clustering that scales: an empirical evaluation of GraphX. In: Proceedings of the 2016 IEEE International Congress on Big Data, pp. 333–336. IEEE Computer Society, San Francisco (2016)

    Google Scholar 

  12. Malewicz, G., et al.: Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 International Conference on Management of Data, pp. 135–146. ACM, Indianapolis (2010)

    Google Scholar 

  13. Ohsumed: text categorization corpus. http://disi.unitn.it/moschitti/corpora.htm. Accessed 6 June 2018

  14. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0026683

    Chapter  Google Scholar 

  15. Weka 3: data mining software in Java. https://www.cs.waikato.ac.nz/ml/weka/. Accessed 6 June 2018

  16. Mulan: a Java library for multi-label learning. http://mulan.sourceforge.net/. Accessed 6 June 2018

  17. Sechidis, K., Tsoumakas, G., Vlahavas, I.: On the stratification of multi-label data. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011. LNCS (LNAI), vol. 6913, pp. 145–158. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23808-6_10

    Chapter  Google Scholar 

  18. Okapi: most advanced open-source machine learning library for Apache Giraph. http://grafos.ml/okapi.html. Accessed 6 June 2018

  19. NetworkX: Python software for complex networks. https://networkx.github.io/. Accessed 6 June 2018

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rafał Woźniak .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Woźniak, R., Zakrzewska, D. (2018). Graph Representation and Semi-clustering Approach for Label Space Reduction in Multi-label Classification of Documents. In: Czachórski, T., Gelenbe, E., Grochla, K., Lent, R. (eds) Computer and Information Sciences. ISCIS 2018. Communications in Computer and Information Science, vol 935. Springer, Cham. https://doi.org/10.1007/978-3-030-00840-6_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00840-6_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00839-0

  • Online ISBN: 978-3-030-00840-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics