Skip to main content

Bag-of-Words, Bag-of-Topics and Word-to-Vec Based Subject Classification of Text Documents in Polish - A Comparative Study

  • Conference paper
  • First Online:
Contemporary Complex Systems and Their Dependability (DepCoS-RELCOMEX 2018)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 761))

Included in the following conference series:

Abstract

This paper deals with the problem of classification of Polish language documents in terms of a subject category. We compare four state-of-the-art approaches to this task which differ primarily in the way the documents are represented by feature vectors. Two methods considered in the study use frequency-of-words or frequency-of-topics representation of the documents and rely on the Natural Language Processing (NLP) technology to pre-process the raw text. Two alternative methods do not involve the NLP technology. They construct feature vectors using vector representation of words (Word2Vec method) or using a frequency of topics derived from the raw text. These four approaches are evaluated using 3 corpora with 5, 34 and 25 subject categories respectively and with a different level of class discrimination. Results suggest that no single method outperforms other method in all tests, however tests with large number of training observations seem to favour the NLP-free Word2Vec methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://scikit-learn.org/0.18/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html.

References

  1. Żak, I., Ciura, M.: Automatic text categorization. Information Systems Architecture and Technology, September 2005

    Google Scholar 

  2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003). http://dl.acm.org/citation.cfm?id=944919.944937

    MATH  Google Scholar 

  3. Ciesielski, K., Borkowski, P., Kłopotek, M.A., Trojanowski, K., Wysocki, K.: Wikipedia-based document categorization. In: Bouvry, P., Kłopotek, M.A., Leprévost, F., Marciniak, M., Mykowiecka, A., Rybiński, H. (eds.) Security and Intelligent Information Systems, pp. 265–278. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  4. Goodman, J.: Classes for fast maximum entropy training. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221). vol. 1, pp. 561–564 (2001)

    Google Scholar 

  5. Harris, Z.: Distributional structure. Word (1954)

    Google Scholar 

  6. Hastie, T.J., Tibshirani, R.J., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer series in statistics. Springer, New York (2009). autres impressions: 2011 (corr.), 2013 (7e corr.)

    Book  Google Scholar 

  7. Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Fisher, D.H. (ed.) Proceedings of the ICML-97, 14th International Conference on Machine Learning, pp. 143–151. Morgan Kaufmann Publishers, San Francisco, Nashville, US (1997). http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.21.7950

  8. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 2, Short Papers, pp. 427–431. Association for Computational Linguistics (2017). http://aclweb.org/anthology/E17-2068

  9. McCallum, A.K.: Mallet: A Machine Learning For Language Toolkit (2002). http://mallet.cs.umass.edu

    Google Scholar 

  10. Młynarczyk, K., Piasecki, M.: Wiki Test - 34 Categories (2015), CLARIN-PL digital repository. http://hdl.handle.net/11321/217

  11. Młynarczyk, K., Piasecki, M.: Wiki Train - 34 Categories (2015). CLARIN-PL digital repository http://hdl.handle.net/11321/222

  12. Piskorski, J., Sydow, M.: Experiments on classification of polish newspaper. Arch. Control Sci. 15, 613–625 (2005)

    MATH  Google Scholar 

  13. Przybyła, P.: Issues of polish question answering. In: Hryniewicz, O., Mielniczuk, J., Penczek, W., Waniewski, J. (eds.) Proceedings of the First Conference ‘Information Technologies: Research and their Interdisciplinary Applications’ (ITRIA 2012), pp. 122–139. Institute of Computer Science, Polish Academy of Sciences (2012)

    Google Scholar 

  14. Radziszewski, A.: A tiered CRF tagger for Polish. Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence, vol. 467, pp. 215–230. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  15. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)

    Article  Google Scholar 

  16. Torkkola, K.: Discriminative features for text document classification. Form. Pattern Anal. Appl. 6(4), 301–308 (2004). https://doi.org/10.1007/s10044-003-0196-8

    MathSciNet  Google Scholar 

  17. Walkowiak, T.: Language processing modelling notation - orchestration of nlp microservices. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds.) Advances in Dependability Engineering of Complex Systems, pp. 464–473. Springer, Cham (2018)

    Chapter  Google Scholar 

  18. Walkowiak, T., Datko, S., Maciejewski, H.: Feature extraction in subject classification of text documents in polish. Artificial Intelligence and Soft Computing. Springer, Cham (2018)

    Book  Google Scholar 

  19. Walkowiak, T., Malak, P.: Polish texts topic classification evaluation. In: Proceedings of the 10th International Conference on Agents and Artificial Intelligence, vol. 2, ICAART, pp. 515–522. INSTICC, SciTePress (2018)

    Google Scholar 

  20. Zadrożny, S., Kacprzyk, J., Gajewski, M., Wysocki, M.: A novel text classification problem and its solution. Czasopismo Techniczne. Automatyka R. 110, z. 4-AC, 7–16 (2013)

    Google Scholar 

  21. Zadrożny, S., Kacprzyk, J.: Computing with words for text processing: an approach to the text categorization. Inf. Sci. 176(4), 415–437 (2006). http://dx.doi.org/10.1016/j.ins.2005.07.017

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgment

This work was sponsored by National Science Centre, Poland (grant 2016/21/B/ST6/02159).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tomasz Walkowiak .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Walkowiak, T., Datko, S., Maciejewski, H. (2019). Bag-of-Words, Bag-of-Topics and Word-to-Vec Based Subject Classification of Text Documents in Polish - A Comparative Study. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds) Contemporary Complex Systems and Their Dependability. DepCoS-RELCOMEX 2018. Advances in Intelligent Systems and Computing, vol 761. Springer, Cham. https://doi.org/10.1007/978-3-319-91446-6_49

Download citation

Publish with us

Policies and ethics