Bag-of-Words, Bag-of-Topics and Word-to-Vec Based Subject Classification of Text Documents in Polish - A Comparative Study

Walkowiak, Tomasz; Datko, Szymon; Maciejewski, Henryk

doi:10.1007/978-3-319-91446-6_49

Tomasz Walkowiak¹⁹,
Szymon Datko¹⁹ &
Henryk Maciejewski¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 761))

Included in the following conference series:

International Conference on Dependability and Complex Systems

764 Accesses
15 Citations

Abstract

This paper deals with the problem of classification of Polish language documents in terms of a subject category. We compare four state-of-the-art approaches to this task which differ primarily in the way the documents are represented by feature vectors. Two methods considered in the study use frequency-of-words or frequency-of-topics representation of the documents and rely on the Natural Language Processing (NLP) technology to pre-process the raw text. Two alternative methods do not involve the NLP technology. They construct feature vectors using vector representation of words (Word2Vec method) or using a frequency of topics derived from the raw text. These four approaches are evaluated using 3 corpora with 5, 34 and 25 subject categories respectively and with a different level of class discrimination. Results suggest that no single method outperforms other method in all tests, however tests with large number of training observations seem to favour the NLP-free Word2Vec methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://scikit-learn.org/0.18/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html.

References

Żak, I., Ciura, M.: Automatic text categorization. Information Systems Architecture and Technology, September 2005
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003). http://dl.acm.org/citation.cfm?id=944919.944937
MATH Google Scholar
Ciesielski, K., Borkowski, P., Kłopotek, M.A., Trojanowski, K., Wysocki, K.: Wikipedia-based document categorization. In: Bouvry, P., Kłopotek, M.A., Leprévost, F., Marciniak, M., Mykowiecka, A., Rybiński, H. (eds.) Security and Intelligent Information Systems, pp. 265–278. Springer, Heidelberg (2012)
Chapter Google Scholar
Goodman, J.: Classes for fast maximum entropy training. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221). vol. 1, pp. 561–564 (2001)
Google Scholar
Harris, Z.: Distributional structure. Word (1954)
Google Scholar
Hastie, T.J., Tibshirani, R.J., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer series in statistics. Springer, New York (2009). autres impressions: 2011 (corr.), 2013 (7e corr.)
Book Google Scholar
Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Fisher, D.H. (ed.) Proceedings of the ICML-97, 14th International Conference on Machine Learning, pp. 143–151. Morgan Kaufmann Publishers, San Francisco, Nashville, US (1997). http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.21.7950
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 2, Short Papers, pp. 427–431. Association for Computational Linguistics (2017). http://aclweb.org/anthology/E17-2068
McCallum, A.K.: Mallet: A Machine Learning For Language Toolkit (2002). http://mallet.cs.umass.edu
Google Scholar
Młynarczyk, K., Piasecki, M.: Wiki Test - 34 Categories (2015), CLARIN-PL digital repository. http://hdl.handle.net/11321/217
Młynarczyk, K., Piasecki, M.: Wiki Train - 34 Categories (2015). CLARIN-PL digital repository http://hdl.handle.net/11321/222
Piskorski, J., Sydow, M.: Experiments on classification of polish newspaper. Arch. Control Sci. 15, 613–625 (2005)
MATH Google Scholar
Przybyła, P.: Issues of polish question answering. In: Hryniewicz, O., Mielniczuk, J., Penczek, W., Waniewski, J. (eds.) Proceedings of the First Conference ‘Information Technologies: Research and their Interdisciplinary Applications’ (ITRIA 2012), pp. 122–139. Institute of Computer Science, Polish Academy of Sciences (2012)
Google Scholar
Radziszewski, A.: A tiered CRF tagger for Polish. Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence, vol. 467, pp. 215–230. Springer, Heidelberg (2013)
Chapter Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)
Article Google Scholar
Torkkola, K.: Discriminative features for text document classification. Form. Pattern Anal. Appl. 6(4), 301–308 (2004). https://doi.org/10.1007/s10044-003-0196-8
MathSciNet Google Scholar
Walkowiak, T.: Language processing modelling notation - orchestration of nlp microservices. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds.) Advances in Dependability Engineering of Complex Systems, pp. 464–473. Springer, Cham (2018)
Chapter Google Scholar
Walkowiak, T., Datko, S., Maciejewski, H.: Feature extraction in subject classification of text documents in polish. Artificial Intelligence and Soft Computing. Springer, Cham (2018)
Book Google Scholar
Walkowiak, T., Malak, P.: Polish texts topic classification evaluation. In: Proceedings of the 10th International Conference on Agents and Artificial Intelligence, vol. 2, ICAART, pp. 515–522. INSTICC, SciTePress (2018)
Google Scholar
Zadrożny, S., Kacprzyk, J., Gajewski, M., Wysocki, M.: A novel text classification problem and its solution. Czasopismo Techniczne. Automatyka R. 110, z. 4-AC, 7–16 (2013)
Google Scholar
Zadrożny, S., Kacprzyk, J.: Computing with words for text processing: an approach to the text categorization. Inf. Sci. 176(4), 415–437 (2006). http://dx.doi.org/10.1016/j.ins.2005.07.017
Article MathSciNet Google Scholar

Download references

Acknowledgment

This work was sponsored by National Science Centre, Poland (grant 2016/21/B/ST6/02159).

Author information

Authors and Affiliations

Faculty of Electronics, Wrocław University of Science and Technology, Wrocław, Poland
Tomasz Walkowiak, Szymon Datko & Henryk Maciejewski

Authors

Tomasz Walkowiak
View author publications
You can also search for this author in PubMed Google Scholar
Szymon Datko
View author publications
You can also search for this author in PubMed Google Scholar
Henryk Maciejewski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tomasz Walkowiak .

Editor information

Editors and Affiliations

Department of Computer Engineering, Wrocław University of Technology, Wrocław, Poland
Wojciech Zamojski
Department of Computer Engineering, Wrocław University of Technology, Wrocław, Poland
Jacek Mazurkiewicz
Department of Computer Engineering, Wrocław University of Technology, Wrocław, Poland
Jarosław Sugier
Department of Computer Engineering, Wrocław University of Technology, Wrocław, Poland
Tomasz Walkowiak
Polish Academy of Sciences, Systems Research Institute, Warsaw, Poland
Janusz Kacprzyk

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Walkowiak, T., Datko, S., Maciejewski, H. (2019). Bag-of-Words, Bag-of-Topics and Word-to-Vec Based Subject Classification of Text Documents in Polish - A Comparative Study. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds) Contemporary Complex Systems and Their Dependability. DepCoS-RELCOMEX 2018. Advances in Intelligent Systems and Computing, vol 761. Springer, Cham. https://doi.org/10.1007/978-3-319-91446-6_49

Download citation

DOI: https://doi.org/10.1007/978-3-319-91446-6_49
Published: 27 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-91445-9
Online ISBN: 978-3-319-91446-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics