Skip to main content

Deep Stylometry and Lexical & Syntactic Features Based Author Attribution on PLoS Digital Repository

  • Conference paper
  • First Online:
Digital Libraries: Data, Information, and Knowledge for Digital Lives (ICADL 2017)

Abstract

In this paper, we address the problem of author attribution through unsupervised clustering using lexical and syntactic features and novel deep learning based Stylometric model. For this purpose, we download all available 158918 publications accessible till 1 July 2015 from PLOS.org - an open access digital repository of full text publications. After pre-processing, out of these, we use 803 single authored publications written by 203 unique authors. For unsupervised modeling, stylometric markers such as lexical and syntactic features are used as a distance matrix by employing k-Means clustering algorithm. For supervised modeling, we present a novel long short-term memory (LSTM) based deep learning model that predicts the testing accuracy of a given publication written by an author. Finally, our unsupervised model shows that 88.17% authors are classified into correct cluster (all papers written by the same author) with at most 0.2 coefficient of Entropy error. While our deep learning based model consistently shows above 95% accuracy across all the given testing samples of publications written by an author with an average loss of 0.21.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Juola, P.: Authorship attribution. Found. Trends® Inf. Retrieval 1(3), 233–334 (2008)

    Article  Google Scholar 

  2. Rudman, J.: Non-traditional authorship attribution studies: Ignis Fatuus or Rosetta stone? Bull. (Bibliograph. Soc. Aust. NZ) 24(3), 163 (2000)

    Google Scholar 

  3. Stamatatos, E.: A survey of modern authorship attribution methods. J. Assoc. Inf. Sci. Technol. 60(3), 538–556 (2009)

    Article  Google Scholar 

  4. Smalheiser, N.R., Torvik, V.I.: Author name disambiguation. Annu. Rev. Inf. Sci. Technol. 43(1), 1–43 (2009)

    Article  Google Scholar 

  5. Gipp, B., Meuschke, N.: Citation pattern matching algorithms for citation-based plagiarism detection: greedy citation tiling, citation chunking and longest common citation sequence. In: Proceedings of the 11th ACM Symposium on Document Engineering, pp. 249–258 (2011)

    Google Scholar 

  6. Bergsma, S., Post, M., Yarowsky, D.: Stylometric analysis of scientific articles. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 327–337 (2012)

    Google Scholar 

  7. Eissen, Z.M.S., Stein, B.: Intrinsic plagiarism detection. In: European Conference on Information Retrieval, pp. 565–569 (2006)

    Google Scholar 

  8. Smith, M.W.: Forensic stylometry: a theoretical basis for further developments of practical methods. J. Forensic Sci. Soc. 29(1), 15–33 (1989)

    Article  Google Scholar 

  9. Zheng, R., Qin, Y., Huang, Z., Chen, H.: Authorship analysis in cybercrime investigation. In: Chen, H., Miranda, R., Zeng, Daniel D., Demchak, C., Schroeder, J., Madhusudan, T. (eds.) ISI 2003. LNCS, vol. 2665, pp. 59–73. Springer, Heidelberg (2003). doi:10.1007/3-540-44853-5_5

    Chapter  Google Scholar 

  10. Wang, L.Z.: News authorship identification with deep learning. https://cs224d.stanford.edu/reports/ZhouWang.pdf. Accessed 4 Jan 2017

  11. Macke, S., Hirshman, J.: Deep Sentence-Level Authorship Attribution. https://cs224d.stanford.edu/reports/MackeStephen.pdf. Accessed 5 Feb 2017

  12. Liu, P., Qiu, X., Huang, X.: Recurrent neural network for text classification with multi-task learning (2016)

    Google Scholar 

  13. Surendran, K., Harilal, O.P., Hrudya, P., Poornachandran, P., Suchetha, N.K.: Stylometry detection using deep learning. In: Behera, H., Mohapatra, D. (eds.) Computational Intelligence in Data Mining, pp. 749–757. Springer, Singapore (2017). doi:10.1007/978-981-10-3874-7_71

    Chapter  Google Scholar 

  14. PLOS.org. https://plos.org/. Accessed 3 Jan 2017

  15. Nirkhi, M.S.: Stylometric approach for author identification of online messages. Int. J. Comput. Sci. Inf. Technol. 5(5), 6158–6159 (2014)

    Google Scholar 

  16. Mustafa, T.K., Mustapha, N., Azmi, M.A., Sulaiman, N.B.: Dropping down the maximum item set: improving the stylometric authorship attribution algorithm in the text mining for authorship investigation. J. Comput. Sci. 6(3), 235 (2010)

    Article  Google Scholar 

  17. Chakraborty, T.: Authorship identification in Bengali literature: a comparative analysis. arXiv preprint arXiv:1208.6268 (2012)

  18. Bozkurt, I.N., Baglioglu, O., Uyar, E.: Authorship attribution. In: 22nd International Symposium IEEE Computer and Information Sciences, pp. 1–5 (2007)

    Google Scholar 

  19. Eder, M.: Style-markers in authorship attribution a cross-language study of the authorial fingerprint. Stud. Pol. Linguist. 6(1), 99–114 (2011)

    Google Scholar 

  20. Voyer, D.: Word frequency and laterality effects in lexical decision: right hemisphere mechanisms. Brain Lang. 87(3), 421–431 (2003)

    Article  Google Scholar 

  21. OpenNLP. https://opennlp.apache.org/. Accessed 1 Feb 2017

  22. Porter, M.F.: Snowball: a language for stemming algorithms. snowball.tartarus.org/texts/introduction.htm. Accessed 17 June 2017

  23. List of part-of-speech tags. https://www.ling.upenn.edu/courses/Fall_2003/ling001/p-enn_treebank_pos.html. Accessed 17 June 2017

  24. Bagnall, D.: Author identification using multi-headed recurrent neural networks. arXiv preprint arXiv:1506.04891 (2015)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Saeed-Ul Hassan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hassan, SU., Imran, M., Iftikhar, T., Safder, I., Shabbir, M. (2017). Deep Stylometry and Lexical & Syntactic Features Based Author Attribution on PLoS Digital Repository. In: Choemprayong, S., Crestani, F., Cunningham, S. (eds) Digital Libraries: Data, Information, and Knowledge for Digital Lives. ICADL 2017. Lecture Notes in Computer Science(), vol 10647. Springer, Cham. https://doi.org/10.1007/978-3-319-70232-2_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-70232-2_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-70231-5

  • Online ISBN: 978-3-319-70232-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics