Skip to main content

Authorship Attribution Using Stylometry and Machine Learning Techniques

  • Conference paper
  • First Online:
Intelligent Systems Technologies and Applications

Abstract

Plagiarism is considered to be a highly unethical activity in the academic world. Text-alignment is currently the preferred technique for estimating the degree of similarity with existing written works. Due to its dependency on other documents it becomes increasingly tedious and time-consuming to scale up to the growing number of online and offline documents. Thus, this paper aims at studying the use of stylometric features present in a document in order to verify its authorship. Two machine learning algorithms, namely k-NN and SMO, were used to predict the authenticity of the writings. A computer program consisting of 446 features was implemented. Ten PhD theses, split into different segments of 1000, 5000 and 10000 words, were used, totaling 520 documents as our corpus. Our results show that authorship attribution using stylometry method has generated an accuracy of above 90 %, except for 7-NN with 1000 words. We also showed how authorship attribution can be used to identify potential cases of plagiarism in formal writings.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Coyotl-Morales, R.M., Villaseñor-Pineda, L., Montes-y-Gómez, M., Rosso, P.: Authorship Attribution Using Word Sequences. In: Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds.) CIARP 2006. LNCS, vol. 4225, pp. 844–853. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  2. Kim, S., Kim, H., Weninger, T., Han, J. and Kim, H. D.: Authorship Classification: A Discriminative Syntactic Tree Mining Approach. In: Proceedings of the ACM SIGIR, July 24–28, Beijing, China (2011)

    Google Scholar 

  3. Nirkhi, S.M., Dharaskar, R.V.: Comparative Study of Authorship Identification Techniques for Cyber Forensics Analysis. International Journal of Advanced Computer Science and Applications 4(5), 32–35 (2013)

    Article  Google Scholar 

  4. Khan, S.R., Nirkhi, S.M., Dharaskar, R..V.: E-mail Data Analysis for Application to Cyber Forensic Investigation using Data Mining. In: Proceedings of the 2nd National Conference on Innovative Paradigms in Engineering & Technology (NCIPET 2013), New York, USA (2013)

    Google Scholar 

  5. Maurer, H., Zaka, B.: Plagiarism–A Problem and How to Fight It. In: Proceedings of World Conference on Education Multimedia, Hypermedia and Telecommunications, AACE, pp. 4451–4458 (2007)

    Google Scholar 

  6. Mozgovoy, M., Kakkonen, T., Cosma, G.: Automatic student plagiarism detection: future perspectives. Journal Educational Computing Research 43(4), 511–531 (2010)

    Article  Google Scholar 

  7. ICAI, Current Cheating Statistics. http://www.academicintegrity.org/icai/integrity-3.php. (accessed April 3, 2015)

  8. Mechti, S., Jaoua, M. Belguith, L H.: A framework for Plagiarism Detection based on Author Profiling. In: Notebook for PAN at CLEF 2013 (2013). http://www.clef-initiative.eu/documents/71612/c7a0e432-dd82-46b1-ab9e-5d0dd98c3a8d (accessed March 3, 2015)

  9. Smith, I.: The Invisible Web: Where Search Engines Fear to Go (2015). http://www.powerhomebiz.com/vol25/invisible.htm (accessed April 1, 2015)

  10. Turnitin, iParadigms (2015). http://turnitin.com/ (accessed March 22, 2015)

  11. Viper, Viper the Anti-plagiarism Scanner, Viper’s features (2015). http://www.scanmyessay.com/features.php (accessed April 2, 2015)

  12. Plagium, Plagium (2015). http://www.plagium.com/ (accessed April 2, 2015)

  13. PlagTracker, PlagTracker (2015). http://www.plagtracker.com/(accessed April 2, 2015)

  14. Paper Rater, About Paper Rater (2015). http://www.paperrater.com/about (accessed April 2, 2015)

  15. Grammarly, Grammarly (2015). http://www.grammarly.com (accessed April 2, 2015)

  16. Horovitz, S.J.: Two Wrong Don’t Negate a Copyright: Don’t Make Students Turnitin if You Won’t Give it Back. Florida Law Review 60(1), 229–268 (2008)

    Google Scholar 

  17. TurnitinBot, TurnitinBot General Information Page (2015). https://turnitin.com/robot/crawlerinfo.html (accessed: March 15, 2015)

  18. Cheat For Turnitin, Limitations to Turnitin. Tips For How To Cheat Turnitin? (2015). http://cheatturnitin.blogspot.com/ (accessed March 15, 2015)

  19. Argamon, S., Levitan, S.: Measuring the usefulness of function words for authorship attribution. In: Proceedings of the 2005 ACH/ALLC Conference (2005)

    Google Scholar 

  20. Hoover, D.L.: Frequent collocations and authorial style. Literary and Linguistic Computing 19(3), 261(28) (2004)

    Google Scholar 

  21. Nirkhi, S.M., Dharaskar, R.V., Thakare, V.M.: Authorship Attribution of online messages using Stylometry: An Exploratory Study. In: International Conference on Advances in Engineering and Technology (ICAET’2014) (2014)

    Google Scholar 

  22. Luyckx, K., Daelemans, W.: Authorship attribution and verification with many authors and limited data. In: Proceeding of the 22nd International Conference on Computational Linguistics, Vol. 1, pp. 513–520 (2008)

    Google Scholar 

  23. Lewis, D., Yang, Y., Rose, T., Li, F.: RCV1: A New Benchmark Collection for Text Categorisation Research. Journal of Machine Learning Research 5, 361–397 (2004)

    Google Scholar 

  24. Iqbal, F., Hadjidj, R., Fung, B.C.M., Debbadi, M.: A Novel Approach of Mining Write-Prints for Authorship Attribution in E-mail Forensics. Proceedings of the Digital Forensic Research Workshop, pp. 42–51. Elsevier Ltd., Quebec (2008)

    Google Scholar 

  25. Abbasi, A., Chen, H.: Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems 2(2), Article 7 (2008)

    Google Scholar 

  26. Abbasi, A., Chen, H.: Visualizing Authorship for Identification. In: Mehrotra, S., Zeng, D.D., Chen, H., Thuraisingham, B., Wang, F.-Y. (eds.) ISI 2006. LNCS, vol. 3975, pp. 60–71. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  27. Pavelec, D., Justino, E., Oliveira, L.S.: Author Identification using Stylometric Features. Inteligencia Artificial, Revista Iberoamericana de Inteligencia Artificial 11(36), 59–65 (2007)

    Google Scholar 

  28. Stańczyk, U., Cyran, K.A.: Machine learning approach to authorship attribution of literary texts. International Journal of Applied Mathematics & Informatics 1(4), 151–158 (2007)

    Google Scholar 

  29. Iqbal, F., Binsalleeh, H., Fung, B.C.M., Debbabi, M.: Mining writeprints from anonymous e-mails for forensic investigation. Digital Investigation, Science Direct 7(1), 56–64 (2010)

    Article  Google Scholar 

  30. López-Monroy, A.P., Montes-y-Gómez, M., Villaseñor-Pineda, L., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: A New Document Author Representation for Authorship Attribution. In: Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F., Olvera López, J.A., Boyer, K.L. (eds.) MCPR 2012. LNCS, vol. 7329, pp. 283–292. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  31. Koppel, M., Schler J., Argamon, S., Winter, Y.: The Fundamental Problem of Authorship Attribution. English Studies 93(3), 284–291 (2012). Taylor & Francis

    Google Scholar 

  32. Koppel, M., Argamon, S., Shimoni, A.R.: Automatically Categorizing Written Texts by Author Gender. Literary and Linguistic Computing 17(4), 401–412 (2002)

    Article  Google Scholar 

  33. Halteren, H.V.: Linguistic Profiling for Author Recognition and Verification. In Proceedings: 42nd Annual Meeting on Association for Computational Linguistics (ACL04), Barcelona, Spain, pp. 199–206 (2004)

    Google Scholar 

  34. Koppel, M., Schler, J., Argamon, S., Messeri, E.: Authorship attribution with thousands of candidate authors. In: Proceedings of the ACM SIGIR, New York, USA, pp. 659–660 (2006)

    Google Scholar 

  35. Stamatatos, E.: Author identification: Using text sampling to handle the class imbalance problem. ECAI, IOS Press, Vol. 44, pp. 790–799 (2008)

    Google Scholar 

  36. Allison, B., Guthrie, L.: Authorship Attribution of E-Mail: Comparing Classifiers over a New Corpus for Evaluation. In: International Conference on Language Resources and Evaluation, Marrakech, Morocco (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sameerchand Pudaruth .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Ramnial, H., Panchoo, S., Pudaruth, S. (2016). Authorship Attribution Using Stylometry and Machine Learning Techniques. In: Berretti, S., Thampi, S., Srivastava, P. (eds) Intelligent Systems Technologies and Applications. Advances in Intelligent Systems and Computing, vol 384. Springer, Cham. https://doi.org/10.1007/978-3-319-23036-8_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-23036-8_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-23035-1

  • Online ISBN: 978-3-319-23036-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics