Feature Exploration for Authorship Attribution of Lithuanian Parliamentary Speeches

Kapočiūtė-Dzikienė, Jurgita; Utka, Andrius; Šarkutė, Ligita

doi:10.1007/978-3-319-10816-2_12

Jurgita Kapočiūtė-Dzikienė²¹,
Andrius Utka²¹ &
Ligita Šarkutė²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8655))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1524 Accesses
1 Citations

Abstract

This paper reports the first authorship attribution results based on the automatic computational methods for the Lithuanian language. Using supervised machine learning techniques we experimentally investigated the influence of different feature types (lexical, character, and syntactic) focusing on a few authors within three datasets, containing transcripts of the parliamentary speeches and debates. Due to our aim to keep as many interfering factors as possible to a minimum, all datasets were composed by selecting candidates having the same political views (avoiding ideology-based classification) from the overlapping parliamentary terms (avoiding topic classification task).

Experiments revealed that content-based features are more useful compared with the function words or part-of-speech tags; moreover, lemma n-grams (sometimes used in concatenation with morphological information) outperform word or document-level character n-grams. Due to the fact that Lithuanian is highly inflective, morphologically and vocabulary rich; moreover, we were dealing with the normative language; therefore morphological tools were maximally helpful.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Argamon, S., Levitan, S.: Measuring the usefulness of function words for authorship attribution. In: 2005 Joint Conference of the Association for Computers and Humanities and the Association for Literary and Linguistic Computing, pp. 1–3 (2005)
Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297 (1995)
MATH Google Scholar
Daudaravičius, V., Rimkutė, E., Utka, A.: Morphological annotation of the Lithuanian corpus. In: Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies (ACL 2007), pp. 94–99 (2007)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: An Update. SIGKDD Explorations 11(1), 10–18 (2009)
Article Google Scholar
Kapočiūtė-Dzikienė, J., Vaassen, F., Daelemans, W., Krupavičius, A.: Improving Topic Classification for Highly Inflective Languages. In: 24th International Conference on Computational Linguistics (COLING 2012), pp. 1393–1410 (2012)
Google Scholar
Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring Differentiability: Unmasking Pseudonymous Authors. Journal of Machine Learning Research 8, 1261–1276 (2007)
MATH Google Scholar
Kotsiantis, S.B.: Supervised Machine Learning: A Review of Classification Techniques. Informatica 31, 249–268 (2007)
MATH MathSciNet Google Scholar
Lewis, D.D., Gale, W.A.: A sequential algorithm for training text classifiers. In: 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1994), pp. 3–12 (1994)
Google Scholar
Lithuanian Parliament official page, http://www3.lrs.lt/pls/inter/w5_sale.kad_ses
Luyckx, K.: Authorship Attribution of E-mail as a Multi-Class Task – Notebook for PAN at CLEF 2011. In: Petras, V., Forner P., Clough P. (eds.) Cross-Language Evaluation Forum (Notebook Papers/Labs/Workshop) (2011)
Google Scholar
Luyckx, K., Daelemans, W.: The effect of author set size and data size in authorship attribution. Literary and Linguistic Computing 26(1), 35–55 (2011)
Article Google Scholar
Maciej, E.: Does size matter? Authorship attribution, small samples, big problem. In: Literary and Linguistic Computing (2013)
Google Scholar
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
MATH Google Scholar
McNemar, Q.M.: Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12(2), 153–157 (1947)
Article Google Scholar
Mikros, G.K., Perifanos, K.: Authorship identification in large email collections: Experiments using features that belong to different linguistic levels – Notebook for PAN at CLEF 2011. In: Petras, V., Forner P., Clough P. (eds.) Cross-Language Evaluation Forum (Notebook Papers/Labs/Workshop) (2011)
Google Scholar
Pikčilingis, J.: Kas yra stilius (What is style?). Vaga, Vilnius (1971) (in Lithuanian)
Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34, 1–47 (2002)
Article MathSciNet Google Scholar
Stamatatos, E.: A Survey of Modern Authorship Attribution Methods. Journal of the Association for Information Science and Technology 60(3), 538–556 (2009)
Article Google Scholar
WEKA Machine Learning Toolkit, http://www.cs.waikato.ac.nz/ml/weka/
Zinkevičius, V.: Lemuoklis morfologinei analizei (Morphological analysis with Lemuoklis). In: Gudaitis, L. (ed.) Darbai ir Dienos, vol. 24, pp. 246–273 (2000) (in Lithuanian)
Google Scholar
Žalkauskaitė, G.: Idiolekto požymiai elektroniniuose laiškuose (Idiolect signs in the e-mails). PhD dissertation, Vilnius University, Lithuania (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Vytautas Magnus University, K. Donelaičio 58, LT-44248, Kaunas, Lithuania
Jurgita Kapočiūtė-Dzikienė & Andrius Utka
Kaunas University of Technology, K. Donelaičio 73, LT-44029, Kaunas, Lithuania
Ligita Šarkutė

Authors

Jurgita Kapočiūtė-Dzikienė
View author publications
You can also search for this author in PubMed Google Scholar
Andrius Utka
View author publications
You can also search for this author in PubMed Google Scholar
Ligita Šarkutė
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Botanicá 6a, 60200, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Department of Information Technologies, Masaryk University, 602 00, Brno, Czech Republic
Aleš Horák , Ivan Kopeček & Karel Pala , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kapočiūtė-Dzikienė, J., Utka, A., Šarkutė, L. (2014). Feature Exploration for Authorship Attribution of Lithuanian Parliamentary Speeches. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2014. Lecture Notes in Computer Science(), vol 8655. Springer, Cham. https://doi.org/10.1007/978-3-319-10816-2_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-10816-2_12
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10815-5
Online ISBN: 978-3-319-10816-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Feature Exploration for Authorship Attribution of Lithuanian Parliamentary Speeches