Graph Representation and Semi-clustering Approach for Label Space Reduction in Multi-label Classification of Documents

Woźniak, Rafał; Zakrzewska, Danuta

doi:10.1007/978-3-030-00840-6_14

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 935))

Included in the following conference series:

International Symposium on Computer and Information Sciences

714 Accesses

Abstract

An increasing number of large online text repositories require effective techniques of document classification. In many cases, more than one class label should be assigned to documents. When the number of labels is big, it is difficult to obtain required multi-label classification accuracy. Efficient label space dimension reduction may significantly improve classification performance. In the paper, we consider applying graph-based semi-clustering algorithm, where documents are represented by vertices with edge weights calculated according to the similarity of associated texts. Semi-clusters are used for finding patterns of labels that occur together. Such approach enables reducing label dimensionality. The performance of the method is examined by experiments conducted on real medical documents. The assessment of classification results, in terms of Classification Accuracy, F-Measure and Hamming Loss, obtained for the most popular multi-label classifiers: Binary Relevance, Classifier Chains and Label Powerset showed good potential of the proposed methodology.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Tsoumakas, G., Katakis, I., Vlahavas, I.: Effective and efficient multilabel classification in domains with large number of labels. In: Proceedings of the ECML/PKDD Workshop on Mining Multidimensional Data, Antwerp, Belgium, pp. 30–44 (2008)
Google Scholar
Balasubramanian, K., Lebanon, G.: The landmark selection method for multiple output prediction. In: Proceedings of the 29th International Conference on Machine Learning, pp. 283–290. Omni Press, Edinburgh (2012)
Google Scholar
Read, J., Pfahringer, B., Holmes, G.: Multi-label classification using ensembles of pruned sets. In: Proceedings of the 2008 8th IEEE International Conference on Data Mining, pp. 995–1000. IEEE Computer Society, Washington, DC (2008)
Google Scholar
Bi, W., Kwok, J.: Efficient multi-label classification with many labels. In: Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, vol. 28, pp. 405–413 (2013)
Google Scholar
Hsu, D., Kakade, S.M., Langford, J., Zhang, T.: Multi-label prediction via compressed sensing. In: Bengio, Y., Schuurmans, D., Lafferty, J.D., Williams, C.K.I., Culotta, A. (eds.) Advances in Neural Information Processing Systems, vol. 22, pp. 772–780. Curran Associates Inc., Vancouver (2009)
Google Scholar
Herrera, F., Charte, F., Rivera, A.J., del Jesus, M.J.: Multilabel Classification. Problem Analysis, Metrics and Techniques. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41111-8
Book Google Scholar
Zhang, M., Zhou, Z.: A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 26(8), 1819–1837 (2014)
Article Google Scholar
Woźniak, R., Ożdżyński, P., Zakrzewska, D.: Cluster analysis of medical text documents by using semi-clustering approach based on graph representation. Inf. Syst. Manag. 7(3), 213–224 (2018)
Google Scholar
Glinka, K., Woźniak, R., Zakrzewska, D.: Improving multi-label medical text classification by feature selection. In: Proceedings of the 2017 IEEE 26th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises, pp. 176–181. IEEE Computer Society, Poznań (2017)
Google Scholar
Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’Reilly Media Inc., Sebastopol (2009)
MATH Google Scholar
Andersen, J.S., Zukunft, O.: Semi-clustering that scales: an empirical evaluation of GraphX. In: Proceedings of the 2016 IEEE International Congress on Big Data, pp. 333–336. IEEE Computer Society, San Francisco (2016)
Google Scholar
Malewicz, G., et al.: Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 International Conference on Management of Data, pp. 135–146. ACM, Indianapolis (2010)
Google Scholar
Ohsumed: text categorization corpus. http://disi.unitn.it/moschitti/corpora.htm. Accessed 6 June 2018
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0026683
Chapter Google Scholar
Weka 3: data mining software in Java. https://www.cs.waikato.ac.nz/ml/weka/. Accessed 6 June 2018
Mulan: a Java library for multi-label learning. http://mulan.sourceforge.net/. Accessed 6 June 2018
Sechidis, K., Tsoumakas, G., Vlahavas, I.: On the stratification of multi-label data. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011. LNCS (LNAI), vol. 6913, pp. 145–158. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23808-6_10
Chapter Google Scholar
Okapi: most advanced open-source machine learning library for Apache Giraph. http://grafos.ml/okapi.html. Accessed 6 June 2018
NetworkX: Python software for complex networks. https://networkx.github.io/. Accessed 6 June 2018

Download references

Author information

Authors and Affiliations

Institute of Information Technology, Lodz University of Technology, Wólczańska 215, 90-924, Łódź, Poland
Rafał Woźniak & Danuta Zakrzewska

Authors

Rafał Woźniak
View author publications
You can also search for this author in PubMed Google Scholar
Danuta Zakrzewska
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rafał Woźniak .

Editor information

Editors and Affiliations

Institute of Theoretical and Applied Informatics, Polish Academy of Sciences, Gliwice, Poland
Tadeusz Czachórski
Department of Electrical and Electronic Engineering, Imperial College London, London, UK
Erol Gelenbe
Institute of Theoretical and Applied Informatics, Polish Academy of Sciences, Gliwice, Poland
Krzysztof Grochla
University of Houston, Houston, TX, USA
Ricardo Lent

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Woźniak, R., Zakrzewska, D. (2018). Graph Representation and Semi-clustering Approach for Label Space Reduction in Multi-label Classification of Documents. In: Czachórski, T., Gelenbe, E., Grochla, K., Lent, R. (eds) Computer and Information Sciences. ISCIS 2018. Communications in Computer and Information Science, vol 935. Springer, Cham. https://doi.org/10.1007/978-3-030-00840-6_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-00840-6_14
Published: 16 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00839-0
Online ISBN: 978-3-030-00840-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Graph Representation and Semi-clustering Approach for Label Space Reduction in Multi-label Classification of Documents