Abstract
In this paper, we address the problem of author attribution through unsupervised clustering using lexical and syntactic features and novel deep learning based Stylometric model. For this purpose, we download all available 158918 publications accessible till 1 July 2015 from PLOS.org - an open access digital repository of full text publications. After pre-processing, out of these, we use 803 single authored publications written by 203 unique authors. For unsupervised modeling, stylometric markers such as lexical and syntactic features are used as a distance matrix by employing k-Means clustering algorithm. For supervised modeling, we present a novel long short-term memory (LSTM) based deep learning model that predicts the testing accuracy of a given publication written by an author. Finally, our unsupervised model shows that 88.17% authors are classified into correct cluster (all papers written by the same author) with at most 0.2 coefficient of Entropy error. While our deep learning based model consistently shows above 95% accuracy across all the given testing samples of publications written by an author with an average loss of 0.21.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Juola, P.: Authorship attribution. Found. Trends® Inf. Retrieval 1(3), 233–334 (2008)
Rudman, J.: Non-traditional authorship attribution studies: Ignis Fatuus or Rosetta stone? Bull. (Bibliograph. Soc. Aust. NZ) 24(3), 163 (2000)
Stamatatos, E.: A survey of modern authorship attribution methods. J. Assoc. Inf. Sci. Technol. 60(3), 538–556 (2009)
Smalheiser, N.R., Torvik, V.I.: Author name disambiguation. Annu. Rev. Inf. Sci. Technol. 43(1), 1–43 (2009)
Gipp, B., Meuschke, N.: Citation pattern matching algorithms for citation-based plagiarism detection: greedy citation tiling, citation chunking and longest common citation sequence. In: Proceedings of the 11th ACM Symposium on Document Engineering, pp. 249–258 (2011)
Bergsma, S., Post, M., Yarowsky, D.: Stylometric analysis of scientific articles. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 327–337 (2012)
Eissen, Z.M.S., Stein, B.: Intrinsic plagiarism detection. In: European Conference on Information Retrieval, pp. 565–569 (2006)
Smith, M.W.: Forensic stylometry: a theoretical basis for further developments of practical methods. J. Forensic Sci. Soc. 29(1), 15–33 (1989)
Zheng, R., Qin, Y., Huang, Z., Chen, H.: Authorship analysis in cybercrime investigation. In: Chen, H., Miranda, R., Zeng, Daniel D., Demchak, C., Schroeder, J., Madhusudan, T. (eds.) ISI 2003. LNCS, vol. 2665, pp. 59–73. Springer, Heidelberg (2003). doi:10.1007/3-540-44853-5_5
Wang, L.Z.: News authorship identification with deep learning. https://cs224d.stanford.edu/reports/ZhouWang.pdf. Accessed 4 Jan 2017
Macke, S., Hirshman, J.: Deep Sentence-Level Authorship Attribution. https://cs224d.stanford.edu/reports/MackeStephen.pdf. Accessed 5 Feb 2017
Liu, P., Qiu, X., Huang, X.: Recurrent neural network for text classification with multi-task learning (2016)
Surendran, K., Harilal, O.P., Hrudya, P., Poornachandran, P., Suchetha, N.K.: Stylometry detection using deep learning. In: Behera, H., Mohapatra, D. (eds.) Computational Intelligence in Data Mining, pp. 749–757. Springer, Singapore (2017). doi:10.1007/978-981-10-3874-7_71
PLOS.org. https://plos.org/. Accessed 3 Jan 2017
Nirkhi, M.S.: Stylometric approach for author identification of online messages. Int. J. Comput. Sci. Inf. Technol. 5(5), 6158–6159 (2014)
Mustafa, T.K., Mustapha, N., Azmi, M.A., Sulaiman, N.B.: Dropping down the maximum item set: improving the stylometric authorship attribution algorithm in the text mining for authorship investigation. J. Comput. Sci. 6(3), 235 (2010)
Chakraborty, T.: Authorship identification in Bengali literature: a comparative analysis. arXiv preprint arXiv:1208.6268 (2012)
Bozkurt, I.N., Baglioglu, O., Uyar, E.: Authorship attribution. In: 22nd International Symposium IEEE Computer and Information Sciences, pp. 1–5 (2007)
Eder, M.: Style-markers in authorship attribution a cross-language study of the authorial fingerprint. Stud. Pol. Linguist. 6(1), 99–114 (2011)
Voyer, D.: Word frequency and laterality effects in lexical decision: right hemisphere mechanisms. Brain Lang. 87(3), 421–431 (2003)
OpenNLP. https://opennlp.apache.org/. Accessed 1 Feb 2017
Porter, M.F.: Snowball: a language for stemming algorithms. snowball.tartarus.org/texts/introduction.htm. Accessed 17 June 2017
List of part-of-speech tags. https://www.ling.upenn.edu/courses/Fall_2003/ling001/p-enn_treebank_pos.html. Accessed 17 June 2017
Bagnall, D.: Author identification using multi-headed recurrent neural networks. arXiv preprint arXiv:1506.04891 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Hassan, SU., Imran, M., Iftikhar, T., Safder, I., Shabbir, M. (2017). Deep Stylometry and Lexical & Syntactic Features Based Author Attribution on PLoS Digital Repository. In: Choemprayong, S., Crestani, F., Cunningham, S. (eds) Digital Libraries: Data, Information, and Knowledge for Digital Lives. ICADL 2017. Lecture Notes in Computer Science(), vol 10647. Springer, Cham. https://doi.org/10.1007/978-3-319-70232-2_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-70232-2_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-70231-5
Online ISBN: 978-3-319-70232-2
eBook Packages: Computer ScienceComputer Science (R0)