, Volume 121, Issue 1, pp 209–239 | Cite as

Semantic measure of plagiarism using a hierarchical graph model

  • Tingting Zhang
  • Baozhen Lee
  • Qinghua ZhuEmail author


Traditional plagiarism detection is based primarily on methods of character matching or topic similarity. Another promising methodology remains largely unexplored: employing deep mining to establish a contextual hierarchy among themes. This paper proposes a semantic approach to measuring the extent of plagiarism, based on a hierarchical graph model. The main innovations are as follows: (1) hierarchical extraction of topic feature terms and elucidation of a corresponding graph structure; (2) graph similarity calculation based on the maximum common subgraph. This semantic-measure method goes beyond semantic detection of topics to take into account the context of topic feature terms, as well as the hierarchical structure by which those topics are related. This contextual-hierarchical perspective should, in turn, improve the accuracy of plagiarism detection. In addition, by mining the implicit relationships between hierarchical feature terms, our method can detect plagiarized documents with similar themes but using different topic words: a potential boon to plagiarism detection recall. In an experiment conducted on a dataset from Chinese paper database CNKI, the semantic-measure method indeed demonstrates accuracy and recall superior to those achieved with current state-of-the-art methods.


Plagiarism detection Semantic measure Graph model Hierarchical structure 

Mathematics Subject Classification

68T30 68T50 90B10 



The author acknowledges the support by the Project No. 71673122 funded by National Natural Science Foundation of China and Nanjing University Innovation and Creative Program for PhD candidate CXCY17-09.


  1. Abdi, A., Idris, N., Alguliyev, R. M., & Aliguliyev, R. M. (2015). PDLK: plagiarism detection using linguistic knowledge. Expert Systems with Applications, 42(22), 8936–8946.Google Scholar
  2. Aizawa, A. (2003). An information-theoretic perspective of Tf–IDF measures. Information Processing and Management, 39(1), 45–65.zbMATHGoogle Scholar
  3. Alzahrani, S. M., Salim N., Abraham, A., & Palade, V. (2011). iPlag: Intelligent plagiarism reasoner in scientific publications. In World congress on information and communication technologies (WICT), pp. 1–6.Google Scholar
  4. Alzahrani, S. M., Salim, N., & Abraham, A. (2012). Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Transactions on Systems, Man, and Cybernetics Part C, 42(2), 133–149.Google Scholar
  5. Atoum, I., & Otoom, A. (2016). Efficient hybrid semantic text similarity using WordNet and a corpus. International Journal of Advanced Computer Science and Applications, 7(9), 124–130.Google Scholar
  6. Barrón-Cedeño, A., & Rosso, P. (2009). On automatic plagiarism detection based on n-grams comparison. In European conference on information retrieval, pp. 696–700.Google Scholar
  7. Biswas, S. K., Bordoloi, M., & Shreya, J. (2018). A graph based keyword extraction model using collective node weight. Expert Systems with Applications, 97, 51–59.Google Scholar
  8. Chahal, P., Singh, M., & Kumar, S. (2013). An ontology based approach for finding semantic similarity between web documents. International Journal of Current Engineering and Technology, 3(5), 1925–1931.Google Scholar
  9. Chen, Q., Yao, L., & Yang, J. (2017). Short text classification based on LDA topic model. In International conference on audio, language and image processing (ICALIP), IEEE.Google Scholar
  10. Chow, T. W. S., & Rahman, M. K. M. (2009). Multilayer SOM with tree-structured data for efficient document retrieval and plagiarism detection. IEEE Transactions on Neural Networks, 20(9), 1385–1402.Google Scholar
  11. Deepika, J., Archana, V., Bagyalakshmi, V., & Preethi, P. (2011). A knowledge based approach to detection of idea plagiarism in online research publications. International Journal on Internet and Distributed Computing System, 1(2), 51–61.Google Scholar
  12. Eisa, T. A. E., Salim, N., & Alzahrani, S. (2015). Existing plagiarism detection techniques: A systematic mapping of the scholarly literature. Online Information Review, 39(3), 383–400.Google Scholar
  13. Elhadi, M., & Al-Tobi, A. (2008). Use of text syntactical structures in detection of document duplicates. In 2008 Third international conference on digital information management, ICDIM, pp. 520–525.Google Scholar
  14. Ezzikouri, H., Erritali, M., & Oukessou, M. (2017). Fuzzy-semantic similarity for automatic multilingual plagiarism detection. International Journal of Advanced Computer Science and Applications, 8(9), 86–90.Google Scholar
  15. Ferreira, R., Lins, R. D., Freitas, F., Simske, S. J., & Riss, M. (2014). A new sentence similarity assessment measure based on a three-layer sentence representation. In Proceedings of the 2014 ACM symposium on document engineering, pp. 25–34.Google Scholar
  16. Ferrero, J., Agnes, F., Besacier, L., et al. (2017). Using word embedding for cross-language plagiarism detection. arXiv preprint arXiv:1702.03082.
  17. Franco-Salvador, M., Rosso, P., & Montes-y-Gómez, M. (2016). A systematic study of knowledge graph analysis for cross-language plagiarism detection. Information Processing and Management, 52(4), 550–570.Google Scholar
  18. García-Romero, A., & Estrada-Lorenzo, J. M. (2014). A bibliometric analysis of plagiarism and self-plagiarism through Déjà vu. Scientometrics, 101(1), 381–396.Google Scholar
  19. Gupta, D., Vani, K., & Singh, C. K. (2014). Using natural language processing techniques and fuzzy-semantic similarity for automatic external plagiarism detection. In IEEE 2014 international conference on advances in computing, communications and informatics (ICACCI), pp. 2694–2699.Google Scholar
  20. Hiremath, S. A., & Otari, M. S. (2014). Plagiarism detection—different methods and their analysis. International Journal of Innovative Research in Advanced Engineering, 1(7), 41–47.Google Scholar
  21. Hoad, T. C., & Zobel, J. (2003). Methods for identifying versioned and plagiarized documents. Journal of the American Society for Information Science and Technology, 54(3), 203–215.Google Scholar
  22. Islam, A., & Inkpen, D. (2008). Semantic text similarity using corpus-based word similarity and string similarity. ACM Transactions on Knowledge Discovery from Data, 2(2), 1–25.Google Scholar
  23. Jarić, I. (2016). High time for a common plagiarism detection system. Scientometrics, 106(1), 457–459.MathSciNetGoogle Scholar
  24. Jinquan, W., Maocheng, L., & Hongliang, Y. (2007). A measure of sentence similarity based on n-grams and vector space model. Modern Foreign Languages, 4, 011.Google Scholar
  25. Kim, W., Jang, H., Kim, H. J., et al. (2016). A document query search using an extended centrality with the word2vec. In ICEC 2016International conference on electronic commerce: E-commerce in smart connected world, pp. 14:1–14:8.Google Scholar
  26. Lau, J. H., & Baldwin T. (2016). An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint. arXiv:1607.05368.
  27. Le, Q. V., & Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of the 31th international conference on machine learning (ICML’14), Vol. 32, Beijing, China, JMLR Proceedings, pp. 1188–1196.Google Scholar
  28. Li, M. (2018). Classifying and ranking topic terms based on a novel approach: role differentiation of author keywords. Scientometrics, 116(1), 1–24.Google Scholar
  29. Li, S., Sun, Y., & Soergel, D. (2015). A new method for automatically constructing domain-oriented term taxonomy based on weighted word co-occurrence analysis. Scientometrics, 103(3), 1023–1042.Google Scholar
  30. Liu, M., Lang, B., Gu, Z., et al. (2017). Measuring similarity of academic articles with semantic profile and joint word embedding. Tsinghua Science and Technology, 06, 71–84.Google Scholar
  31. Liu, X., Xu, C., & Ouyang, B. (2015). Plagiarism detection algorithm for source code in computer science education. International Journal of Distance Education Technologies (IJDET), 13(4), 29–39.Google Scholar
  32. Luo, L., Ming, J., Wu, D., Liu, P., & Zhu, S. (2017). Semantics-based obfuscation-resilient binary code similarity comparison with applications to software and algorithm plagiarism detection. IEEE Transactions on Software Engineering, 43(12), 1157–1177.Google Scholar
  33. Mariani, J., Francopoulo, G., & Paroubek, P. (2018). Reuse and plagiarism in speech and natural language processing publications. International Journal on Digital Libraries, 19(2–3), 113–126.Google Scholar
  34. Menai, M. E. B. (2012). Detection of plagiarism in Arabic documents. International Journal of Information Technology and Computer Science, 10, 80–89.Google Scholar
  35. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint. arXiv:1301.3781.
  36. Momtaz, M., Bijari, K., Salehi, M., & Veisi, H. (2016). Graph-based approach to text alignment for plagiarism detection in Persian documents. In FIRE, pp. 176–179. Accessed 30 Sep 2018.
  37. Niraula, N., Banjade, R., Ştefănescu, D., et al. (2013). Experiments with semantic similarity measures based on LDA and LSA. In International conference on statistical language and speech processing, Springer, Berlin.Google Scholar
  38. Osman, A. H., & Barukab, O. M. (2017). SVM significant role selection method for improving semantic text plagiarism detection. International Journal of Advanced and Applied Sciences, 4(8), 112–122.Google Scholar
  39. Osman, A. H., Salim, N., Binwahlan, S., Hentabli, H., & Ali, A. M. (2011). Conceptual similarity and graph-based method for plagiarism detection. Journal of Theoretical and Applied Information Technology, 32(2), 135–145.Google Scholar
  40. Osman, A. H., Salim, N., Binwwahlan, M. S., Alteeb, R., & Abuobieda, A. (2012). An improved plagiarism detection scheme based on semantic role labeling. Applied Soft Computing, 12(5), 1493–1502.Google Scholar
  41. Rahim, R., Kurniasih, N., Irawan, M. D., Siregar, Y. H., Hasibuan, A., Sari, D. A. P., et al. (2018). Latent semantic indexing for Indonesian text similarity. International Journal of Engineering & Technology, 7(23), 73–77.Google Scholar
  42. Ramachandran, L., & Gehringer, E. F. (2011). Determining degree of relevance of reviews using a graph-based text representation. In IEEE 23rd international conference on tools with artificial intelligence, pp. 442–445.Google Scholar
  43. Rehurek, R. (2008). Plagiarism detection through vector space models applied to a digital library. In Proceedings of the second workshop on recent advances in slavonic natural languages, pp. 75–83.Google Scholar
  44. Rexha, A., Kröll, M., Ziak, H., & Kern, R. (2018). Authorship identification of documents with high content similarity. Scientometrics, 115(1), 223–237.Google Scholar
  45. Sánchez, D., Batet, M., Isern, D., & Valls, A. (2012). Ontology-based semantic similarity: A new feature-based approach. Expert Systems with Applications, 39(9), 7718–7728.Google Scholar
  46. Schuhmacher, M., & Ponzetto, S. P. (2014). Knowledge-based graph document modeling. In Proceedings of the 7th ACM international conference on Web search and data mining, ACM, pp. 543–552.Google Scholar
  47. Silva, F. B., Werneck, R. D. O., Goldenstein, S., Tabbone, S., & Torres, R. D. S. (2018). Graph-based bag-of-words for classification. Pattern Recognition, 74, 266–285.Google Scholar
  48. Sonawane, S. S., & Kulkarni, P. A. (2014). Graph based representation and analysis of text document: A survey of techniques. International Journal of Computer Applications, 96(19), 1–8.Google Scholar
  49. Tan, C.-M., Wang, Y.-F., & Lee, C.-D. (2002). The use of bigrams to enhance text categorization. Information Processing and Management, 38(4), 529–546.zbMATHGoogle Scholar
  50. Tang, W., Du, Z. O. U., & Zhang, L. (2017). A plagiarism detection method based on learning behavior analysis. In DEStech transactions on social science, education and human science, international conference on education reform and modern management (ERMM), pp. 43–47.Google Scholar
  51. Tien, N. M., & Labbé, C. (2018). Detecting automatically generated sentences with grammatical structure similarity. Scientometrics, 116(2), 1247–1271.Google Scholar
  52. Vani, K., & Gupta, D. (2015). Investigating the impact of combined similarity metrics and POS tagging in extrinsic text plagiarism detection system. In International conference on advances in computing, communications and informatics (ICACCI), pp. 1578–1584.Google Scholar
  53. Vani, K., & Gupta, D. (2017). Detection of idea plagiarism using syntax–semantic concept extractions with genetic algorithm. Expert Systems with Applications, 73, 11–26.Google Scholar
  54. Vani, K., & Gupta, D. (2018a). Unmasking text plagiarism using syntactic-semantic based natural language processing techniques: Comparisons, analysis and challenges. Information Processing and Management, 54(3), 408–432.Google Scholar
  55. Vani, K., & Gupta, D. (2018b). Integrating syntax-semantic-based text analysis with structural and citation information for scientific plagiarism detection. Journal of the Association for Information Science and Technology, 69(11), 1330–1345.Google Scholar
  56. Wu, J., Xuan, Z., & Pan, D. (2011). Enhancing text representation for classification tasks with semantic graph structures. International Journal of Innovative Computing, Information, & Control, 7(5), 2689–2698.Google Scholar
  57. Zhang, C., Chen, L., & Li, Q. (2016). A Chinese text similarity calculation algorithm based on DF_LDA. In Proceedings of the 6th international asia conference on industrial engineering and management innovation, Atlantis Press.Google Scholar
  58. Zhang, H., & Chow, T. W. S. (2011). A coarse-to-fine framework to efficiently thwart plagiarism. Pattern Recognition, 44(2), 471–487.Google Scholar
  59. Zhang, W., Yoshida, T., & Tang, X. (2011). A comparative study of TF*IDF, LSI and multi-words for text classification. Expert Systems with Applications, 38(3), 2758–2765.Google Scholar
  60. Zhang, Y., Jin, R., & Zhou, Z. H. (2010). Understanding bag-of-words model: A statistical framework. International Journal of Machine Learning and Cybernetics, 1(1–4), 43–52.Google Scholar

Copyright information

© Akadémiai Kiadó, Budapest, Hungary 2019

Authors and Affiliations

  1. 1.School of Engineering ManagementNanjing UniversityNanjingChina
  2. 2.School of Information EngineeringNanjing Audit UniversityNanjingChina

Personalised recommendations