Abstract
An increasing number of large online text repositories require effective techniques of document classification. In many cases, more than one class label should be assigned to documents. When the number of labels is big, it is difficult to obtain required multi-label classification accuracy. Efficient label space dimension reduction may significantly improve classification performance. In the paper, we consider applying graph-based semi-clustering algorithm, where documents are represented by vertices with edge weights calculated according to the similarity of associated texts. Semi-clusters are used for finding patterns of labels that occur together. Such approach enables reducing label dimensionality. The performance of the method is examined by experiments conducted on real medical documents. The assessment of classification results, in terms of Classification Accuracy, F-Measure and Hamming Loss, obtained for the most popular multi-label classifiers: Binary Relevance, Classifier Chains and Label Powerset showed good potential of the proposed methodology.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Tsoumakas, G., Katakis, I., Vlahavas, I.: Effective and efficient multilabel classification in domains with large number of labels. In: Proceedings of the ECML/PKDD Workshop on Mining Multidimensional Data, Antwerp, Belgium, pp. 30–44 (2008)
Balasubramanian, K., Lebanon, G.: The landmark selection method for multiple output prediction. In: Proceedings of the 29th International Conference on Machine Learning, pp. 283–290. Omni Press, Edinburgh (2012)
Read, J., Pfahringer, B., Holmes, G.: Multi-label classification using ensembles of pruned sets. In: Proceedings of the 2008 8th IEEE International Conference on Data Mining, pp. 995–1000. IEEE Computer Society, Washington, DC (2008)
Bi, W., Kwok, J.: Efficient multi-label classification with many labels. In: Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, vol. 28, pp. 405–413 (2013)
Hsu, D., Kakade, S.M., Langford, J., Zhang, T.: Multi-label prediction via compressed sensing. In: Bengio, Y., Schuurmans, D., Lafferty, J.D., Williams, C.K.I., Culotta, A. (eds.) Advances in Neural Information Processing Systems, vol. 22, pp. 772–780. Curran Associates Inc., Vancouver (2009)
Herrera, F., Charte, F., Rivera, A.J., del Jesus, M.J.: Multilabel Classification. Problem Analysis, Metrics and Techniques. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41111-8
Zhang, M., Zhou, Z.: A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 26(8), 1819–1837 (2014)
Woźniak, R., Ożdżyński, P., Zakrzewska, D.: Cluster analysis of medical text documents by using semi-clustering approach based on graph representation. Inf. Syst. Manag. 7(3), 213–224 (2018)
Glinka, K., Woźniak, R., Zakrzewska, D.: Improving multi-label medical text classification by feature selection. In: Proceedings of the 2017 IEEE 26th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises, pp. 176–181. IEEE Computer Society, Poznań (2017)
Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’Reilly Media Inc., Sebastopol (2009)
Andersen, J.S., Zukunft, O.: Semi-clustering that scales: an empirical evaluation of GraphX. In: Proceedings of the 2016 IEEE International Congress on Big Data, pp. 333–336. IEEE Computer Society, San Francisco (2016)
Malewicz, G., et al.: Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 International Conference on Management of Data, pp. 135–146. ACM, Indianapolis (2010)
Ohsumed: text categorization corpus. http://disi.unitn.it/moschitti/corpora.htm. Accessed 6 June 2018
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0026683
Weka 3: data mining software in Java. https://www.cs.waikato.ac.nz/ml/weka/. Accessed 6 June 2018
Mulan: a Java library for multi-label learning. http://mulan.sourceforge.net/. Accessed 6 June 2018
Sechidis, K., Tsoumakas, G., Vlahavas, I.: On the stratification of multi-label data. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011. LNCS (LNAI), vol. 6913, pp. 145–158. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23808-6_10
Okapi: most advanced open-source machine learning library for Apache Giraph. http://grafos.ml/okapi.html. Accessed 6 June 2018
NetworkX: Python software for complex networks. https://networkx.github.io/. Accessed 6 June 2018
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Woźniak, R., Zakrzewska, D. (2018). Graph Representation and Semi-clustering Approach for Label Space Reduction in Multi-label Classification of Documents. In: Czachórski, T., Gelenbe, E., Grochla, K., Lent, R. (eds) Computer and Information Sciences. ISCIS 2018. Communications in Computer and Information Science, vol 935. Springer, Cham. https://doi.org/10.1007/978-3-030-00840-6_14
Download citation
DOI: https://doi.org/10.1007/978-3-030-00840-6_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00839-0
Online ISBN: 978-3-030-00840-6
eBook Packages: Computer ScienceComputer Science (R0)