Skip to main content

Authorship Attribution: Comparison of Single-Layer and Double-Layer Machine Learning

  • Conference paper
Text, Speech and Dialogue (TSD 2012)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7499))

Included in the following conference series:

Abstract

In the traditional authorship attribution task, forensic linguistic specialists analyse and compare documents to determine who was their (real) author. In the current days, the number of anonymous documents is growing ceaselessly because of Internet expansion. That is why the manual part of the authorship attribution process needs to be replaced with automatic methods. Specialized algorithms (SA) like delta-score and word length statistic were developed to quantify the similarity between documents, but currently prevailing techniques build upon the machine learning (ML) approach.

In this paper, two machine learning approaches are compared: Single-layer ML, where the results of SA (similarities of documents) are used as input attributes for the machine learning, and Double-layer ML with the numerical information characterizing the author being extracted from documents and divided into several groups. For each group the machine learning classifier is trained and the outputs of these classifiers are used as input attributes for ML in the second step.

Generating attributes for the machine learning in the first step of double-layer ML, which is based on SA, is described in detail here. Documents from Czech blog servers are utilized for empirical evaluation of both approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abbasi, A., Chen, H.: Applying authorship analysis to extremist-group web forum messages. IEEE Intelligent Systems 20, 67–75 (2005)

    Google Scholar 

  2. Chen, H., Atabakhsh, H., Zeng, D., et al.: COPLINK: visualization and collaboration for law enforcement. In: Proceedings of the 2002 Annual National Conference on Digital Government Research. dg.o 2002, pp. 1–7. Digital Government Society of North America (2002)

    Google Scholar 

  3. Mosteller, F., Wallace, D.L.: Inference and Disputed Authorship: The Federalist. Addison-Wesley (1964)

    Google Scholar 

  4. Burrows, J.: Delta’: a measure of stylistic authorship 1. Literary and Linguistic Computing 17, 267–287 (2002)

    Article  Google Scholar 

  5. Kim, E., Song, Y., Lee, C., Kim, K., Lee, G.G., Yi, B.K., Cha, J.: Two-phase learning for biological event extraction and verification 5, 61–73 (2006)

    Google Scholar 

  6. Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. 60, 9–26 (2009)

    Article  Google Scholar 

  7. Curk, T., Demšar, J., Xu, Q., Leban, G., Petrovič, U., Bratko, I., Shaulsky, G., Zupan, B.: Microarray data mining with visual programming. Bioinformatics 21, 396–398 (2005)

    Article  Google Scholar 

  8. Chang, C., Lin, C.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~cjlin/libsvm

  9. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9, 1871–1874 (2008)

    MATH  Google Scholar 

  10. Huang, T.K., Weng, R.C., Lin, C.J.: Generalized Bradley-Terry Models and Multi-class Probability Estimates. Journal of Machine Learning Research 7, 85–115 (2006)

    MathSciNet  MATH  Google Scholar 

  11. NLP Centre: (Czech lemma stoplist), http://nlp.fi.muni.cz/cs/Stoplist_zakladnich_tvaru

  12. Šmerk, P.: K počítačové morfologické analýze češtiny (in Czech, Towards Computational Morphological Analysis of Czech). Ph.D. thesis, Faculty of Informatics, Masaryk University (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Rygl, J., Horák, A. (2012). Authorship Attribution: Comparison of Single-Layer and Double-Layer Machine Learning. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2012. Lecture Notes in Computer Science(), vol 7499. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32790-2_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-32790-2_34

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-32789-6

  • Online ISBN: 978-3-642-32790-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics