Skip to main content
Log in

Semantic measure of plagiarism using a hierarchical graph model

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Traditional plagiarism detection is based primarily on methods of character matching or topic similarity. Another promising methodology remains largely unexplored: employing deep mining to establish a contextual hierarchy among themes. This paper proposes a semantic approach to measuring the extent of plagiarism, based on a hierarchical graph model. The main innovations are as follows: (1) hierarchical extraction of topic feature terms and elucidation of a corresponding graph structure; (2) graph similarity calculation based on the maximum common subgraph. This semantic-measure method goes beyond semantic detection of topics to take into account the context of topic feature terms, as well as the hierarchical structure by which those topics are related. This contextual-hierarchical perspective should, in turn, improve the accuracy of plagiarism detection. In addition, by mining the implicit relationships between hierarchical feature terms, our method can detect plagiarized documents with similar themes but using different topic words: a potential boon to plagiarism detection recall. In an experiment conducted on a dataset from Chinese paper database CNKI, the semantic-measure method indeed demonstrates accuracy and recall superior to those achieved with current state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  • Abdi, A., Idris, N., Alguliyev, R. M., & Aliguliyev, R. M. (2015). PDLK: plagiarism detection using linguistic knowledge. Expert Systems with Applications, 42(22), 8936–8946.

    Article  Google Scholar 

  • Aizawa, A. (2003). An information-theoretic perspective of Tf–IDF measures. Information Processing and Management, 39(1), 45–65.

    Article  MATH  Google Scholar 

  • Alzahrani, S. M., Salim N., Abraham, A., & Palade, V. (2011). iPlag: Intelligent plagiarism reasoner in scientific publications. In World congress on information and communication technologies (WICT), pp. 1–6.

  • Alzahrani, S. M., Salim, N., & Abraham, A. (2012). Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Transactions on Systems, Man, and Cybernetics Part C, 42(2), 133–149.

    Article  Google Scholar 

  • Atoum, I., & Otoom, A. (2016). Efficient hybrid semantic text similarity using WordNet and a corpus. International Journal of Advanced Computer Science and Applications, 7(9), 124–130.

    Article  Google Scholar 

  • Barrón-Cedeño, A., & Rosso, P. (2009). On automatic plagiarism detection based on n-grams comparison. In European conference on information retrieval, pp. 696–700.

  • Biswas, S. K., Bordoloi, M., & Shreya, J. (2018). A graph based keyword extraction model using collective node weight. Expert Systems with Applications, 97, 51–59.

    Article  Google Scholar 

  • Chahal, P., Singh, M., & Kumar, S. (2013). An ontology based approach for finding semantic similarity between web documents. International Journal of Current Engineering and Technology, 3(5), 1925–1931.

    Google Scholar 

  • Chen, Q., Yao, L., & Yang, J. (2017). Short text classification based on LDA topic model. In International conference on audio, language and image processing (ICALIP), IEEE.

  • Chow, T. W. S., & Rahman, M. K. M. (2009). Multilayer SOM with tree-structured data for efficient document retrieval and plagiarism detection. IEEE Transactions on Neural Networks, 20(9), 1385–1402.

    Article  Google Scholar 

  • Deepika, J., Archana, V., Bagyalakshmi, V., & Preethi, P. (2011). A knowledge based approach to detection of idea plagiarism in online research publications. International Journal on Internet and Distributed Computing System, 1(2), 51–61.

    Google Scholar 

  • Eisa, T. A. E., Salim, N., & Alzahrani, S. (2015). Existing plagiarism detection techniques: A systematic mapping of the scholarly literature. Online Information Review, 39(3), 383–400.

    Article  Google Scholar 

  • Elhadi, M., & Al-Tobi, A. (2008). Use of text syntactical structures in detection of document duplicates. In 2008 Third international conference on digital information management, ICDIM, pp. 520–525.

  • Ezzikouri, H., Erritali, M., & Oukessou, M. (2017). Fuzzy-semantic similarity for automatic multilingual plagiarism detection. International Journal of Advanced Computer Science and Applications, 8(9), 86–90.

    Article  Google Scholar 

  • Ferreira, R., Lins, R. D., Freitas, F., Simske, S. J., & Riss, M. (2014). A new sentence similarity assessment measure based on a three-layer sentence representation. In Proceedings of the 2014 ACM symposium on document engineering, pp. 25–34.

  • Ferrero, J., Agnes, F., Besacier, L., et al. (2017). Using word embedding for cross-language plagiarism detection. arXiv preprint arXiv:1702.03082.

  • Franco-Salvador, M., Rosso, P., & Montes-y-Gómez, M. (2016). A systematic study of knowledge graph analysis for cross-language plagiarism detection. Information Processing and Management, 52(4), 550–570.

    Article  Google Scholar 

  • García-Romero, A., & Estrada-Lorenzo, J. M. (2014). A bibliometric analysis of plagiarism and self-plagiarism through Déjà vu. Scientometrics, 101(1), 381–396.

    Article  Google Scholar 

  • Gupta, D., Vani, K., & Singh, C. K. (2014). Using natural language processing techniques and fuzzy-semantic similarity for automatic external plagiarism detection. In IEEE 2014 international conference on advances in computing, communications and informatics (ICACCI), pp. 2694–2699.

  • Hiremath, S. A., & Otari, M. S. (2014). Plagiarism detection—different methods and their analysis. International Journal of Innovative Research in Advanced Engineering, 1(7), 41–47.

    Google Scholar 

  • Hoad, T. C., & Zobel, J. (2003). Methods for identifying versioned and plagiarized documents. Journal of the American Society for Information Science and Technology, 54(3), 203–215.

    Article  Google Scholar 

  • Islam, A., & Inkpen, D. (2008). Semantic text similarity using corpus-based word similarity and string similarity. ACM Transactions on Knowledge Discovery from Data, 2(2), 1–25.

    Article  Google Scholar 

  • Jarić, I. (2016). High time for a common plagiarism detection system. Scientometrics, 106(1), 457–459.

    Article  MathSciNet  Google Scholar 

  • Jinquan, W., Maocheng, L., & Hongliang, Y. (2007). A measure of sentence similarity based on n-grams and vector space model. Modern Foreign Languages, 4, 011.

    Google Scholar 

  • Kim, W., Jang, H., Kim, H. J., et al. (2016). A document query search using an extended centrality with the word2vec. In ICEC 2016International conference on electronic commerce: E-commerce in smart connected world, pp. 14:1–14:8.

  • Lau, J. H., & Baldwin T. (2016). An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint. arXiv:1607.05368.

  • Le, Q. V., & Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of the 31th international conference on machine learning (ICML’14), Vol. 32, Beijing, China, JMLR Proceedings, pp. 1188–1196.

  • Li, M. (2018). Classifying and ranking topic terms based on a novel approach: role differentiation of author keywords. Scientometrics, 116(1), 1–24.

    Article  Google Scholar 

  • Li, S., Sun, Y., & Soergel, D. (2015). A new method for automatically constructing domain-oriented term taxonomy based on weighted word co-occurrence analysis. Scientometrics, 103(3), 1023–1042.

    Article  Google Scholar 

  • Liu, M., Lang, B., Gu, Z., et al. (2017). Measuring similarity of academic articles with semantic profile and joint word embedding. Tsinghua Science and Technology, 06, 71–84.

    Google Scholar 

  • Liu, X., Xu, C., & Ouyang, B. (2015). Plagiarism detection algorithm for source code in computer science education. International Journal of Distance Education Technologies (IJDET), 13(4), 29–39.

    Article  Google Scholar 

  • Luo, L., Ming, J., Wu, D., Liu, P., & Zhu, S. (2017). Semantics-based obfuscation-resilient binary code similarity comparison with applications to software and algorithm plagiarism detection. IEEE Transactions on Software Engineering, 43(12), 1157–1177.

    Article  Google Scholar 

  • Mariani, J., Francopoulo, G., & Paroubek, P. (2018). Reuse and plagiarism in speech and natural language processing publications. International Journal on Digital Libraries, 19(2–3), 113–126.

    Article  Google Scholar 

  • Menai, M. E. B. (2012). Detection of plagiarism in Arabic documents. International Journal of Information Technology and Computer Science, 10, 80–89.

    Article  Google Scholar 

  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint. arXiv:1301.3781.

  • Momtaz, M., Bijari, K., Salehi, M., & Veisi, H. (2016). Graph-based approach to text alignment for plagiarism detection in Persian documents. In FIRE, pp. 176–179. http://ceur-ws.org/Vol-1737/T4-9.pdf. Accessed 30 Sep 2018.

  • Niraula, N., Banjade, R., Ştefănescu, D., et al. (2013). Experiments with semantic similarity measures based on LDA and LSA. In International conference on statistical language and speech processing, Springer, Berlin.

  • Osman, A. H., & Barukab, O. M. (2017). SVM significant role selection method for improving semantic text plagiarism detection. International Journal of Advanced and Applied Sciences, 4(8), 112–122.

    Article  Google Scholar 

  • Osman, A. H., Salim, N., Binwahlan, S., Hentabli, H., & Ali, A. M. (2011). Conceptual similarity and graph-based method for plagiarism detection. Journal of Theoretical and Applied Information Technology, 32(2), 135–145.

    Google Scholar 

  • Osman, A. H., Salim, N., Binwwahlan, M. S., Alteeb, R., & Abuobieda, A. (2012). An improved plagiarism detection scheme based on semantic role labeling. Applied Soft Computing, 12(5), 1493–1502.

    Article  Google Scholar 

  • Rahim, R., Kurniasih, N., Irawan, M. D., Siregar, Y. H., Hasibuan, A., Sari, D. A. P., et al. (2018). Latent semantic indexing for Indonesian text similarity. International Journal of Engineering & Technology, 7(23), 73–77.

    Article  Google Scholar 

  • Ramachandran, L., & Gehringer, E. F. (2011). Determining degree of relevance of reviews using a graph-based text representation. In IEEE 23rd international conference on tools with artificial intelligence, pp. 442–445.

  • Rehurek, R. (2008). Plagiarism detection through vector space models applied to a digital library. In Proceedings of the second workshop on recent advances in slavonic natural languages, pp. 75–83.

  • Rexha, A., Kröll, M., Ziak, H., & Kern, R. (2018). Authorship identification of documents with high content similarity. Scientometrics, 115(1), 223–237.

    Article  Google Scholar 

  • Sánchez, D., Batet, M., Isern, D., & Valls, A. (2012). Ontology-based semantic similarity: A new feature-based approach. Expert Systems with Applications, 39(9), 7718–7728.

    Article  Google Scholar 

  • Schuhmacher, M., & Ponzetto, S. P. (2014). Knowledge-based graph document modeling. In Proceedings of the 7th ACM international conference on Web search and data mining, ACM, pp. 543–552.

  • Silva, F. B., Werneck, R. D. O., Goldenstein, S., Tabbone, S., & Torres, R. D. S. (2018). Graph-based bag-of-words for classification. Pattern Recognition, 74, 266–285.

    Article  Google Scholar 

  • Sonawane, S. S., & Kulkarni, P. A. (2014). Graph based representation and analysis of text document: A survey of techniques. International Journal of Computer Applications, 96(19), 1–8.

    Article  Google Scholar 

  • Tan, C.-M., Wang, Y.-F., & Lee, C.-D. (2002). The use of bigrams to enhance text categorization. Information Processing and Management, 38(4), 529–546.

    Article  MATH  Google Scholar 

  • Tang, W., Du, Z. O. U., & Zhang, L. (2017). A plagiarism detection method based on learning behavior analysis. In DEStech transactions on social science, education and human science, international conference on education reform and modern management (ERMM), pp. 43–47.

  • Tien, N. M., & Labbé, C. (2018). Detecting automatically generated sentences with grammatical structure similarity. Scientometrics, 116(2), 1247–1271.

    Article  Google Scholar 

  • Vani, K., & Gupta, D. (2015). Investigating the impact of combined similarity metrics and POS tagging in extrinsic text plagiarism detection system. In International conference on advances in computing, communications and informatics (ICACCI), pp. 1578–1584.

  • Vani, K., & Gupta, D. (2017). Detection of idea plagiarism using syntax–semantic concept extractions with genetic algorithm. Expert Systems with Applications, 73, 11–26.

    Article  Google Scholar 

  • Vani, K., & Gupta, D. (2018a). Unmasking text plagiarism using syntactic-semantic based natural language processing techniques: Comparisons, analysis and challenges. Information Processing and Management, 54(3), 408–432.

    Article  Google Scholar 

  • Vani, K., & Gupta, D. (2018b). Integrating syntax-semantic-based text analysis with structural and citation information for scientific plagiarism detection. Journal of the Association for Information Science and Technology, 69(11), 1330–1345.

    Article  Google Scholar 

  • Wu, J., Xuan, Z., & Pan, D. (2011). Enhancing text representation for classification tasks with semantic graph structures. International Journal of Innovative Computing, Information, & Control, 7(5), 2689–2698.

    Google Scholar 

  • Zhang, C., Chen, L., & Li, Q. (2016). A Chinese text similarity calculation algorithm based on DF_LDA. In Proceedings of the 6th international asia conference on industrial engineering and management innovation, Atlantis Press.

  • Zhang, H., & Chow, T. W. S. (2011). A coarse-to-fine framework to efficiently thwart plagiarism. Pattern Recognition, 44(2), 471–487.

    Article  Google Scholar 

  • Zhang, W., Yoshida, T., & Tang, X. (2011). A comparative study of TF*IDF, LSI and multi-words for text classification. Expert Systems with Applications, 38(3), 2758–2765.

    Article  Google Scholar 

  • Zhang, Y., Jin, R., & Zhou, Z. H. (2010). Understanding bag-of-words model: A statistical framework. International Journal of Machine Learning and Cybernetics, 1(1–4), 43–52.

    Article  Google Scholar 

Download references

Acknowledgements

The author acknowledges the support by the Project No. 71673122 funded by National Natural Science Foundation of China and Nanjing University Innovation and Creative Program for PhD candidate CXCY17-09.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qinghua Zhu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, T., Lee, B. & Zhu, Q. Semantic measure of plagiarism using a hierarchical graph model. Scientometrics 121, 209–239 (2019). https://doi.org/10.1007/s11192-019-03204-x

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-019-03204-x

Keywords

Mathematics Subject Classification

Navigation