Plag-Inn: Intrinsic Plagiarism Detection Using Grammar Trees

  • Michael Tschuggnall
  • Günther Specht
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7337)


Intrinsic plagiarism detection deals with the task of finding plagiarized sections of text documents without using a reference corpus. This paper describes a novel approach to this task by processing and analyzing the grammar of a suspicious document. The main idea is to split a text into single sentences and to calculate grammar trees. To find suspicious sentences, these grammar trees are compared in a distance matrix by using the pq-gram-distance, an alternative for the tree edit distance. Finally, significantly different sentences regarding their grammar and with respect to the Gaussian normal distribution are marked as suspicious.


intrinsic plagiarism detection grammar trees stylistic inconsistencies pq-gram distance NLP applications 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Augsten, N., Böhlen, M., Gamper, J.: The pq-Gram Distance between Ordered Labeled Trees. ACM Transactions on Database Systems (2010)Google Scholar
  2. 2.
    Bille, P.: A survey on tree edit distance and related problems. Theoretical Computuer Science 337, 217–239 (2005)MathSciNetzbMATHCrossRefGoogle Scholar
  3. 3.
    Catherine De Marneffe, M., Maccartney, B., Manning, C.D.: Generating typed dependency parses from phrase structure parses. In: LREC (2006)Google Scholar
  4. 4.
    Karlgren, J.: Stylistic Experiments For Information Retrieval. PhD thesis, Swedish Institute for Computer Science (2000)Google Scholar
  5. 5.
    Kestemont, M., Luyckx, K., Daelemans, W.: Intrinsic Plagiarism Detection Using Character Trigram Distance Scores. In: CLEF 2011 Labs and Workshop, Notebook Papers, Amsterdam, The Netherlands (2011)Google Scholar
  6. 6.
    Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, ACL 2003, Stroudsburg, PA, USA, vol. 1, pp. 423–430 (2003)Google Scholar
  7. 7.
    Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: The Penn Treebank. Comp. Linguistics Linguistics (June 1993)Google Scholar
  8. 8.
    Oberreuter, G., L’Huillier, G., Ríos, S.A., Velásquez, J.D.: Approaches for Intrinsic and External Plagiarism Detection. In: CLEF 2011 Labs and Workshop, Notebook Papers, Amsterdam, The Netherlands (2011)Google Scholar
  9. 9.
    Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B., Rosso, P.: Overview of the 3rd International Competition on Plagiarism Detection. In: Petras, V., Forner, P., Clough, P. (eds.) Notebook Papers of CLEF 11 Labs and Workshops (2011)Google Scholar
  10. 10.
    Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An Evaluation Framework for Plagiarism Detection. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China (August 2010)Google Scholar
  11. 11.
    Seaward, L., Matwin, S.: Intrinsic Plagiarism Detection using Complexity Analysis. In: CLEF (Notebook Papers/Labs/Workshop) (2009)Google Scholar
  12. 12.
    Stamatatos, E.: Intrinsic Plagiarism Detection Using Character n-gram Profiles. In: CLEF (Notebook Papers/Labs/Workshop) (2009)Google Scholar
  13. 13.
    Stamatatos, E., Kokkinakis, G., Fakotakis, N.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26, 471–495 (2000)CrossRefGoogle Scholar
  14. 14.
    Stevenson, M., Gaizauskas, R.: Experiments on sentence boundary detection. In: Proc. of the 6th Conference on Applied Natural Language Processing, ANLC 2000, Stroudsburg, PA, USA, pp. 84–89 (2000)Google Scholar
  15. 15.
    The Stanford Parser, (visited January 2012)

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Michael Tschuggnall
    • 1
  • Günther Specht
    • 1
  1. 1.Databases and Information SystemsInstitute of Computer Science, University of InnsbruckAustria

Personalised recommendations