Authorship Attribution: Comparison of Single-Layer and Double-Layer Machine Learning

Rygl, Jan; Horák, Aleš

doi:10.1007/978-3-642-32790-2_34

Jan Rygl²¹ &
Aleš Horák²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7499))

Included in the following conference series:

International Conference on Text, Speech and Dialogue

1676 Accesses
2 Citations

Abstract

In the traditional authorship attribution task, forensic linguistic specialists analyse and compare documents to determine who was their (real) author. In the current days, the number of anonymous documents is growing ceaselessly because of Internet expansion. That is why the manual part of the authorship attribution process needs to be replaced with automatic methods. Specialized algorithms (SA) like delta-score and word length statistic were developed to quantify the similarity between documents, but currently prevailing techniques build upon the machine learning (ML) approach.

In this paper, two machine learning approaches are compared: Single-layer ML, where the results of SA (similarities of documents) are used as input attributes for the machine learning, and Double-layer ML with the numerical information characterizing the author being extracted from documents and divided into several groups. For each group the machine learning classifier is trained and the outputs of these classifiers are used as input attributes for ML in the second step.

Generating attributes for the machine learning in the first step of double-layer ML, which is based on SA, is described in detail here. Documents from Czech blog servers are utilized for empirical evaluation of both approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abbasi, A., Chen, H.: Applying authorship analysis to extremist-group web forum messages. IEEE Intelligent Systems 20, 67–75 (2005)
Google Scholar
Chen, H., Atabakhsh, H., Zeng, D., et al.: COPLINK: visualization and collaboration for law enforcement. In: Proceedings of the 2002 Annual National Conference on Digital Government Research. dg.o 2002, pp. 1–7. Digital Government Society of North America (2002)
Google Scholar
Mosteller, F., Wallace, D.L.: Inference and Disputed Authorship: The Federalist. Addison-Wesley (1964)
Google Scholar
Burrows, J.: Delta’: a measure of stylistic authorship 1. Literary and Linguistic Computing 17, 267–287 (2002)
Article Google Scholar
Kim, E., Song, Y., Lee, C., Kim, K., Lee, G.G., Yi, B.K., Cha, J.: Two-phase learning for biological event extraction and verification 5, 61–73 (2006)
Google Scholar
Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. 60, 9–26 (2009)
Article Google Scholar
Curk, T., Demšar, J., Xu, Q., Leban, G., Petrovič, U., Bratko, I., Shaulsky, G., Zupan, B.: Microarray data mining with visual programming. Bioinformatics 21, 396–398 (2005)
Article Google Scholar
Chang, C., Lin, C.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~cjlin/libsvm
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9, 1871–1874 (2008)
MATH Google Scholar
Huang, T.K., Weng, R.C., Lin, C.J.: Generalized Bradley-Terry Models and Multi-class Probability Estimates. Journal of Machine Learning Research 7, 85–115 (2006)
MathSciNet MATH Google Scholar
NLP Centre: (Czech lemma stoplist), http://nlp.fi.muni.cz/cs/Stoplist_zakladnich_tvaru
Šmerk, P.: K počítačové morfologické analýze češtiny (in Czech, Towards Computational Morphological Analysis of Czech). Ph.D. thesis, Faculty of Informatics, Masaryk University (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Natural Language Processing Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic
Jan Rygl & Aleš Horák

Authors

Jan Rygl
View author publications
You can also search for this author in PubMed Google Scholar
Aleš Horák
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Informatics, Department of Computer Graphics and Design, Masaryk University, Botanická 68a, 602 00, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Department of Information Technologies, Masaryk University, Botanická 68a, 602 00, Brno, Czech Republic
Aleš Horák , Ivan Kopeček & Karel Pala , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rygl, J., Horák, A. (2012). Authorship Attribution: Comparison of Single-Layer and Double-Layer Machine Learning. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2012. Lecture Notes in Computer Science(), vol 7499. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32790-2_34

Download citation

DOI: https://doi.org/10.1007/978-3-642-32790-2_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32789-6
Online ISBN: 978-3-642-32790-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics