Abstract
This paper deals with the problem of classification of Polish language documents in terms of a subject category. We compare four state-of-the-art approaches to this task which differ primarily in the way the documents are represented by feature vectors. Two methods considered in the study use frequency-of-words or frequency-of-topics representation of the documents and rely on the Natural Language Processing (NLP) technology to pre-process the raw text. Two alternative methods do not involve the NLP technology. They construct feature vectors using vector representation of words (Word2Vec method) or using a frequency of topics derived from the raw text. These four approaches are evaluated using 3 corpora with 5, 34 and 25 subject categories respectively and with a different level of class discrimination. Results suggest that no single method outperforms other method in all tests, however tests with large number of training observations seem to favour the NLP-free Word2Vec methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Żak, I., Ciura, M.: Automatic text categorization. Information Systems Architecture and Technology, September 2005
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003). http://dl.acm.org/citation.cfm?id=944919.944937
Ciesielski, K., Borkowski, P., Kłopotek, M.A., Trojanowski, K., Wysocki, K.: Wikipedia-based document categorization. In: Bouvry, P., Kłopotek, M.A., Leprévost, F., Marciniak, M., Mykowiecka, A., Rybiński, H. (eds.) Security and Intelligent Information Systems, pp. 265–278. Springer, Heidelberg (2012)
Goodman, J.: Classes for fast maximum entropy training. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221). vol. 1, pp. 561–564 (2001)
Harris, Z.: Distributional structure. Word (1954)
Hastie, T.J., Tibshirani, R.J., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer series in statistics. Springer, New York (2009). autres impressions: 2011 (corr.), 2013 (7e corr.)
Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Fisher, D.H. (ed.) Proceedings of the ICML-97, 14th International Conference on Machine Learning, pp. 143–151. Morgan Kaufmann Publishers, San Francisco, Nashville, US (1997). http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.21.7950
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 2, Short Papers, pp. 427–431. Association for Computational Linguistics (2017). http://aclweb.org/anthology/E17-2068
McCallum, A.K.: Mallet: A Machine Learning For Language Toolkit (2002). http://mallet.cs.umass.edu
Młynarczyk, K., Piasecki, M.: Wiki Test - 34 Categories (2015), CLARIN-PL digital repository. http://hdl.handle.net/11321/217
Młynarczyk, K., Piasecki, M.: Wiki Train - 34 Categories (2015). CLARIN-PL digital repository http://hdl.handle.net/11321/222
Piskorski, J., Sydow, M.: Experiments on classification of polish newspaper. Arch. Control Sci. 15, 613–625 (2005)
Przybyła, P.: Issues of polish question answering. In: Hryniewicz, O., Mielniczuk, J., Penczek, W., Waniewski, J. (eds.) Proceedings of the First Conference ‘Information Technologies: Research and their Interdisciplinary Applications’ (ITRIA 2012), pp. 122–139. Institute of Computer Science, Polish Academy of Sciences (2012)
Radziszewski, A.: A tiered CRF tagger for Polish. Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence, vol. 467, pp. 215–230. Springer, Heidelberg (2013)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)
Torkkola, K.: Discriminative features for text document classification. Form. Pattern Anal. Appl. 6(4), 301–308 (2004). https://doi.org/10.1007/s10044-003-0196-8
Walkowiak, T.: Language processing modelling notation - orchestration of nlp microservices. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds.) Advances in Dependability Engineering of Complex Systems, pp. 464–473. Springer, Cham (2018)
Walkowiak, T., Datko, S., Maciejewski, H.: Feature extraction in subject classification of text documents in polish. Artificial Intelligence and Soft Computing. Springer, Cham (2018)
Walkowiak, T., Malak, P.: Polish texts topic classification evaluation. In: Proceedings of the 10th International Conference on Agents and Artificial Intelligence, vol. 2, ICAART, pp. 515–522. INSTICC, SciTePress (2018)
Zadrożny, S., Kacprzyk, J., Gajewski, M., Wysocki, M.: A novel text classification problem and its solution. Czasopismo Techniczne. Automatyka R. 110, z. 4-AC, 7–16 (2013)
Zadrożny, S., Kacprzyk, J.: Computing with words for text processing: an approach to the text categorization. Inf. Sci. 176(4), 415–437 (2006). http://dx.doi.org/10.1016/j.ins.2005.07.017
Acknowledgment
This work was sponsored by National Science Centre, Poland (grant 2016/21/B/ST6/02159).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Walkowiak, T., Datko, S., Maciejewski, H. (2019). Bag-of-Words, Bag-of-Topics and Word-to-Vec Based Subject Classification of Text Documents in Polish - A Comparative Study. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds) Contemporary Complex Systems and Their Dependability. DepCoS-RELCOMEX 2018. Advances in Intelligent Systems and Computing, vol 761. Springer, Cham. https://doi.org/10.1007/978-3-319-91446-6_49
Download citation
DOI: https://doi.org/10.1007/978-3-319-91446-6_49
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-91445-9
Online ISBN: 978-3-319-91446-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)