Computer Based Stylometric Analysis of Texts in Polish Language

Baj, Maciej; Walkowiak, Tomasz

doi:10.1007/978-3-319-59060-8_1

Maciej Baj¹⁹ &
Tomasz Walkowiak¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10246))

Included in the following conference series:

International Conference on Artificial Intelligence and Soft Computing

1978 Accesses
3 Citations

Abstract

The aim of the paper is to compare stylometric methods in a task of authorship, author gender and literacy period recognition for texts in Polish language. Different feature selection and classification methods were analyzed. Features sets include common words (the most common, the rarest and all words) and grammatical classes frequencies, as well as simple statistics of selected characters, words and sentences. Due to the fact that Polish is a highly inflected language common words features are calculated as the frequencies of the lexemes obtained by morpho-syntactic tagger for Polish. Nine different classifiers were analysed. Authors tested proposed methods on a set of Polish novels. Recognition was done on whole novels and chunked texts. Performed experiments showed that the best results are obtained for features based on all words. For ill defined problems (with small recognition accuracy) the random forest classifier gave the best results. In other cases (for tasks with medium or high recognition accuracy) the multilayer perceptron and the linear regression learned by stochastic gradient descent gave the best results. Moreover, the paper includes an analysis of statistical importance of used features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). http://dx.doi.org/10.1023/A:1010933404324
Article MATH Google Scholar
Burrows, J.F.: Delta: a measure of stylistic difference and a guide to likely. Lit. Linguist Comput. 17(3), 267–287 (2002)
Article Google Scholar
Canales, O., Monaco, V., Murphy, T., Edyta Zych, J.S., Tappert, C., Castro, A., Sotoye, O., Torres, L., Truley, G.: A stylometry system for authenticating students taking online tests. In: Proceedings of Student-Faculty Research Day, CSIS. Pace University (2011)
Google Scholar
Craig, H., Kinney, A.: Shakespeare, Computers, and the Mystery of Authorship. Cambridge University Press, Cambridge (2009)
Book Google Scholar
Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., Singer, Y.: Online passive-aggressive algorithms. J. Mach. Learn. Res. 7, 551–585 (2006). http://dl.acm.org/citation.cfm?id=1248547.1248566
MathSciNet MATH Google Scholar
Eder, M.: Style-markers in authorship attribution: a cross-language study of the authorial fingerprint. Stud. Pol. Linguist. 6, 99–114 (2011)
MathSciNet Google Scholar
Eder, M., Piasecki, M., Walkowiak, T.: Open stylometric system based on multilevel text analysis. Cogn. Stud. 17 (2017, to appear)
Google Scholar
Fomenko, A.T., Fomenko, V.P., Fomenko, T.G.: The authorial invariant in Russian literary texts. Its application: who was the real author of the “quiet don”? In: Fomenko, A.T., Nosovskiy, G.V. (eds.) History: Fiction or Science?, pp. 425–444 (2005)
Google Scholar
Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46(1–3), 389–422 (2002)
Article MATH Google Scholar
Hastie, T.J., Tibshirani, R.J., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer, New York (2009). autres impressions: 2011 (corr.), 2013 (7e corr.)
Book MATH Google Scholar
Hoover, D.L.: Testing burrows’s delta. Liter. Linguist. Comput. 19(4), 453–475 (2004)
Article MathSciNet Google Scholar
Joachims, T.: A probabilistic analysis of the rocchio algorithm with TFIDF for text categorization. In: Fisher, D.H. (ed.) ICML, pp. 143–151. Morgan Kaufmann (1997). http://dblp.uni-trier.de/db/conf/icml/icml1997.html#Joachims97
Jockers, M.L., Witten, D.M.: A comparative study of machine learning methods for authorship attribution. Lit. Linguist Comput. 25(2), 215–223 (2010)
Article Google Scholar
Juola, P.: Authorship attribution. Found. Trends Inf. Retr. 1(3), 233–334 (2006). http://dx.doi.org/10.1561/1500000005
Article Google Scholar
Koppel, M., Akiva, N., Dagan, I.: Feature instability as a criterion for selecting potential style markers. J. Am. Soc. Inf. Sci. Technol. 57(11), 1519–1525 (2006)
Article Google Scholar
Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. 60(1), 9–26 (2009)
Article Google Scholar
Peng, R.D.: Hengartner: quantitative analysis of literary style. Am. Stat. 56(3), 175–185 (2002)
Article Google Scholar
Piasecki, M., Radziszewski, A.: Morphological prediction for polish by a statistical a tergo index. Syst. Sci. 34(4), 7–17 (2008)
MATH Google Scholar
Riloff, E.: Little words can make a big difference for text classification. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1995, NY, USA, pp. 130–136. ACM, New York (1995). http://doi.acm.org/10.1145/215206.215349
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc., New York (1986)
MATH Google Scholar
Smith, P., Aldridge, W.: Improving authorship attribution: optimizing burrows’ delta method. J. Quant. Linguist. 18(1), 63–88 (2011)
Article Google Scholar
Tsuruoka, Y., Tsujii, J., Ananiadou, S.: Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, ACL 2009, pp. 477–485. Association for Computational Linguistics, Stroudsburg (2009)
Google Scholar
de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining e-mail content for author identification forensics. SIGMOD Rec. 30(4), 55–64 (2001)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Electronics, Wroclaw University of Science and Technology, Wybrzeze Wyspianskiego 27, 50-370, Wroclaw, Poland
Maciej Baj & Tomasz Walkowiak

Authors

Maciej Baj
View author publications
You can also search for this author in PubMed Google Scholar
Tomasz Walkowiak
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tomasz Walkowiak .

Editor information

Editors and Affiliations

Częstochowa University of Technology, Częstochowa, Poland
Leszek Rutkowski
Częstochowa University of Technology, Częstochowa, Poland
Marcin Korytkowski
Częstochowa University of Technology, Częstochowa, Poland
Rafał Scherer
AGH University of Science and Technology, Kraków, Poland
Ryszard Tadeusiewicz
University of California, Berkeley, California, USA
Lotfi A. Zadeh
University of Louisville, Louisville, Kentucky, USA
Jacek M. Zurada

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Baj, M., Walkowiak, T. (2017). Computer Based Stylometric Analysis of Texts in Polish Language. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L., Zurada, J. (eds) Artificial Intelligence and Soft Computing. ICAISC 2017. Lecture Notes in Computer Science(), vol 10246. Springer, Cham. https://doi.org/10.1007/978-3-319-59060-8_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-59060-8_1
Published: 24 May 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59059-2
Online ISBN: 978-3-319-59060-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics