Skip to main content

Computer Based Stylometric Analysis of Texts in Polish Language

  • Conference paper
  • First Online:
Artificial Intelligence and Soft Computing (ICAISC 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10246))

Included in the following conference series:

Abstract

The aim of the paper is to compare stylometric methods in a task of authorship, author gender and literacy period recognition for texts in Polish language. Different feature selection and classification methods were analyzed. Features sets include common words (the most common, the rarest and all words) and grammatical classes frequencies, as well as simple statistics of selected characters, words and sentences. Due to the fact that Polish is a highly inflected language common words features are calculated as the frequencies of the lexemes obtained by morpho-syntactic tagger for Polish. Nine different classifiers were analysed. Authors tested proposed methods on a set of Polish novels. Recognition was done on whole novels and chunked texts. Performed experiments showed that the best results are obtained for features based on all words. For ill defined problems (with small recognition accuracy) the random forest classifier gave the best results. In other cases (for tasks with medium or high recognition accuracy) the multilayer perceptron and the linear regression learned by stochastic gradient descent gave the best results. Moreover, the paper includes an analysis of statistical importance of used features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). http://dx.doi.org/10.1023/A:1010933404324

    Article  MATH  Google Scholar 

  2. Burrows, J.F.: Delta: a measure of stylistic difference and a guide to likely. Lit. Linguist Comput. 17(3), 267–287 (2002)

    Article  Google Scholar 

  3. Canales, O., Monaco, V., Murphy, T., Edyta Zych, J.S., Tappert, C., Castro, A., Sotoye, O., Torres, L., Truley, G.: A stylometry system for authenticating students taking online tests. In: Proceedings of Student-Faculty Research Day, CSIS. Pace University (2011)

    Google Scholar 

  4. Craig, H., Kinney, A.: Shakespeare, Computers, and the Mystery of Authorship. Cambridge University Press, Cambridge (2009)

    Book  Google Scholar 

  5. Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., Singer, Y.: Online passive-aggressive algorithms. J. Mach. Learn. Res. 7, 551–585 (2006). http://dl.acm.org/citation.cfm?id=1248547.1248566

    MathSciNet  MATH  Google Scholar 

  6. Eder, M.: Style-markers in authorship attribution: a cross-language study of the authorial fingerprint. Stud. Pol. Linguist. 6, 99–114 (2011)

    MathSciNet  Google Scholar 

  7. Eder, M., Piasecki, M., Walkowiak, T.: Open stylometric system based on multilevel text analysis. Cogn. Stud. 17 (2017, to appear)

    Google Scholar 

  8. Fomenko, A.T., Fomenko, V.P., Fomenko, T.G.: The authorial invariant in Russian literary texts. Its application: who was the real author of the “quiet don”? In: Fomenko, A.T., Nosovskiy, G.V. (eds.) History: Fiction or Science?, pp. 425–444 (2005)

    Google Scholar 

  9. Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46(1–3), 389–422 (2002)

    Article  MATH  Google Scholar 

  10. Hastie, T.J., Tibshirani, R.J., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer, New York (2009). autres impressions: 2011 (corr.), 2013 (7e corr.)

    Book  MATH  Google Scholar 

  11. Hoover, D.L.: Testing burrows’s delta. Liter. Linguist. Comput. 19(4), 453–475 (2004)

    Article  MathSciNet  Google Scholar 

  12. Joachims, T.: A probabilistic analysis of the rocchio algorithm with TFIDF for text categorization. In: Fisher, D.H. (ed.) ICML, pp. 143–151. Morgan Kaufmann (1997). http://dblp.uni-trier.de/db/conf/icml/icml1997.html#Joachims97

  13. Jockers, M.L., Witten, D.M.: A comparative study of machine learning methods for authorship attribution. Lit. Linguist Comput. 25(2), 215–223 (2010)

    Article  Google Scholar 

  14. Juola, P.: Authorship attribution. Found. Trends Inf. Retr. 1(3), 233–334 (2006). http://dx.doi.org/10.1561/1500000005

    Article  Google Scholar 

  15. Koppel, M., Akiva, N., Dagan, I.: Feature instability as a criterion for selecting potential style markers. J. Am. Soc. Inf. Sci. Technol. 57(11), 1519–1525 (2006)

    Article  Google Scholar 

  16. Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. 60(1), 9–26 (2009)

    Article  Google Scholar 

  17. Peng, R.D.: Hengartner: quantitative analysis of literary style. Am. Stat. 56(3), 175–185 (2002)

    Article  Google Scholar 

  18. Piasecki, M., Radziszewski, A.: Morphological prediction for polish by a statistical a tergo index. Syst. Sci. 34(4), 7–17 (2008)

    MATH  Google Scholar 

  19. Riloff, E.: Little words can make a big difference for text classification. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1995, NY, USA, pp. 130–136. ACM, New York (1995). http://doi.acm.org/10.1145/215206.215349

  20. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc., New York (1986)

    MATH  Google Scholar 

  21. Smith, P., Aldridge, W.: Improving authorship attribution: optimizing burrows’ delta method. J. Quant. Linguist. 18(1), 63–88 (2011)

    Article  Google Scholar 

  22. Tsuruoka, Y., Tsujii, J., Ananiadou, S.: Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, ACL 2009, pp. 477–485. Association for Computational Linguistics, Stroudsburg (2009)

    Google Scholar 

  23. de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining e-mail content for author identification forensics. SIGMOD Rec. 30(4), 55–64 (2001)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tomasz Walkowiak .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Baj, M., Walkowiak, T. (2017). Computer Based Stylometric Analysis of Texts in Polish Language. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L., Zurada, J. (eds) Artificial Intelligence and Soft Computing. ICAISC 2017. Lecture Notes in Computer Science(), vol 10246. Springer, Cham. https://doi.org/10.1007/978-3-319-59060-8_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-59060-8_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-59059-2

  • Online ISBN: 978-3-319-59060-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics