Scalable and language-independent embedding-based approach for plagiarism detection considering obfuscation type: no training phase

  • Erfaneh Gharavi
  • Hadi VeisiEmail author
  • Paolo Rosso
Original Article


The efficiency and scalability of plagiarism detection systems have become a major challenge due to the vast amount of available textual data in several languages over the Internet. Plagiarism occurs in different levels of obfuscation, ranging from the exact copy of original materials to text summarization. Consequently, designed algorithms to detect plagiarism should be robust to the diverse languages and different types of obfuscation in plagiarism cases. In this paper, we employ text embedding vectors to compare similarity among documents to detect plagiarism. Word vectors are combined by a simple aggregation function to represent a text document. This representation comprises semantic and syntactic information of the text and leads to efficient text alignment among suspicious and original documents. By comparing representations of sentences in source and suspicious documents, pair sentences with the highest similarity are considered as the candidates or seeds of plagiarism cases. To filter and merge these seeds, a set of parameters, including Jaccard similarity and merging threshold, are tuned by two different approaches: offline tuning and online tuning. The offline method, which is used as the benchmark, regulates a unique set of parameters for all types of plagiarism by several trials on the training corpus. Experiments show improvements in performance by considering obfuscation type during threshold tuning. In this regard, our proposed online approach uses two statistical methods to filter outlier candidates automatically by their scale of obfuscation. By employing the online tuning approach, no distinct training dataset is required to train the system. We applied our proposed method on available datasets in English, Persian and Arabic languages on the text alignment task to evaluate the robustness of the proposed methods from the language perspective as well. As our experimental results confirm, our efficient approach can achieve considerable performance on the different datasets in various languages. Our online threshold tuning approach without any training datasets works as well as, or even in some cases better than, the training-base method.


Text alignment Language-independent plagiarism detection Word embedding Text representation Obfuscation type 



The work of Paolo Rosso was partially funded by the Spanish MICINN under the research Project MISMIS-FAKEnHATE on Misinformation and Miscommunication in social media: FAKE news and HATE speech (PGC2018-096212-B-C31).

Compliance with ethical standards

Conflict of interest

The authors declare that they have no competing interests.


  1. 1.
    Agarwal B, Ramampiaro H, Langseth H, Ruocco M (2018) A deep network model for paraphrase detection in short text messages. Inf Process Manag 54(6):922–937CrossRefGoogle Scholar
  2. 2.
    Al-Suhaiqi M, Hazaa MAS, Albared M (2018) Arabic English cross-lingual plagiarism detection based on keyphrases extraction, monolingual and machine learning approach. Asian J Res Comput Sci 2:1–12Google Scholar
  3. 3.
    Alvi F, Stevenson M, Clough PD (2014) Hashing and merging heuristics for text reuse detection. CLEF (working notes), pp 939–946Google Scholar
  4. 4.
    Asghari H, Mohtaj S, Fatemi O, Faili H, Rosso P, Potthast M (2016) Algorithms and corpora for Persian plagiarism detection. In: CEUR workshop proceedings, 1737, pp 135–144Google Scholar
  5. 5.
    Bengio Y, Ducharme R, Vincent P, Janvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155. CrossRefzbMATHGoogle Scholar
  6. 6.
    Bojanowski P, Grave E, Joulin A, Mikolov T (2016) Enriching word vectors with subword information. ArXiv preprint arXiv:1607.04606
  7. 7.
    Chong M, Specia L, Mitkov R (2010) Using natural language processing for automatic detection of plagiarism. Language. Retrieved from
  8. 8.
    Clough P (2003) Old and new challenges in automatic plagiarism detection. National Plagiarism Advisory Service (February), 14. Retrieved from
  9. 9.
    Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537zbMATHGoogle Scholar
  10. 10.
    Ehsan N, Shakery A, Tompa FW (2018) Cross-lingual text alignment for fine-grained plagiarism detection. J Inf Sci. CrossRefGoogle Scholar
  11. 11.
    Esteki F, Esfahani FS (2016) A plagiarism detection approach based on SVM for Persian texts. In: CEUR workshop proceedings, 1737, pp 149–153Google Scholar
  12. 12.
    Ferrero J, Besacier L, Schwab D, Agnès F (2017) Using word embedding for cross-language plagiarism detection. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics: volume 2, short papers.
  13. 13.
    Firth JR (1957) A synopsis of linguistic theory, 1930–1955. Studies in linguistic analysisGoogle Scholar
  14. 14.
    Gharavi E, Veisi H, Bijari K, Zahirnia K (2018) A fast multi-level plagiarism detection method based on document embedding representation. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics). CrossRefGoogle Scholar
  15. 15.
    Gharavi E, Bijari K, Veisi H, Zahirnia K (2016) A deep learning approach to Persian plagiarism detection. Retrieved from
  16. 16.
    Glinos DG (2014) A hybrid architecture for plagiarism detection. CLEF (working notes), pp 958–965Google Scholar
  17. 17.
    Gross P, Modaresi P (2014) Plagiarism alignment detection by merging context seeds. CLEF (working notes), pp 966–972Google Scholar
  18. 18.
    Hinton G (1986) Learning distributed representations of concepts. In: CSS, pp 1–12. CrossRefGoogle Scholar
  19. 19.
    Hoad TC, Zobel J (2003) Methods for identifying versioned and plagiarized documents. J Am Soc Inf Sci Technol 54:203–215. CrossRefGoogle Scholar
  20. 20.
    Kalchbrenner N, Grefenstette E, Blunsom P (2014) A convolutional neural network for modelling sentences. In: ACL, pp 655–665.
  21. 21.
    Le QV, Mikolov T (2014) Distributed representations of sentences and documents, vol 32.
  22. 22.
    Leilei K, Haoliang Q, Cuixia D, Mingxing W, Zhongyuan H (2013) Approaches for source retrieval and text alignment of plagiarism detection: notebook for PAN at CLEF 2013. In: CEUR workshop proceedings, 1179Google Scholar
  23. 23.
    Leilei K, Haoliang Q, Shuai W, Cuixia D (2012) Approaches for candidate document retrieval and detailed comparison of plagiarism detection. Notebook for PAN at CLEF 2012. Retrieved from
  24. 24.
    Livermore MA, Dadgostari F, Guim M, Beling P, Rockmore D (2018) Law search as prediction. Virginia Public Law and Legal Theory Research Paper (2018-61)Google Scholar
  25. 25.
    Mashhadirajab F, Shamsfard M (2016) A text alignment algorithm based on prediction of obfuscation types using SVM neural network. FIRE (working notes), pp 167–171Google Scholar
  26. 26.
    Mikolov T, Corrado G, Chen K, Dean J (2013) Efficient estimation of word representations in vector space. In: Proceedings of the international conference on learning representations (ICLR 2013), pp 1–12.
  27. 27.
    Mikolov T, Yih W, Zweig G (2013) Linguistic regularities in continuous space word representations. In: Proceedings of NAACL-HLT (June), pp 746–751. Retrieved from
  28. 28.
    Minaei B, Niknam M (2016) An n-gram based method for nearly copy detection in plagiarism systems. FIRE (working notes), pp 172–175Google Scholar
  29. 29.
    Mitchell J, Lapata M (2010) Composition in distributional models of semantics. Cognit Sci 34(8):1388–1429. CrossRefGoogle Scholar
  30. 30.
    Momtaz M, Bijari K, Salehi M, Veisi H (2016) Graph-based approach to text alignment for plagiarism detection in persian documents. FIRE (working notes), pp 176–179Google Scholar
  31. 31.
    Palkovskii Y, Belov A (2013) Using hybrid similarity methods for plagiarism detection. Notebook for PAN at CLEF 2013Google Scholar
  32. 32.
    Palkovskii Y, Belov A (2014) Developing high-resolution universal multi-type N-gram plagiarism detector. Working notes papers of the CLEF 2014 evaluation labs, pp 984–989Google Scholar
  33. 33.
    Pennington J, Socher R, Manning CD (2014) GloVe: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing, pp 1532–1543.
  34. 34.
    Potthast M, Stein B, Eiselt A, Barrón-Cedeño A, Rosso P (2009) Overview of the 1st international competition on plagiarism detection. In: SEPLN 09 workshop on uncovering plagiarism, authorship, and social software misuse, pp 1–9. Retrieved from
  35. 35.
    Potthast M, Hagen M, Beyer A, Busse M, Tippmann M, Rosso P, Stein B (2014) Overview of the 6th international competition on plagiarism detection. Notebook for PAN at CLEF 2014, pp 845–876Google Scholar
  36. 36.
    Potthast M, Hagen M, Gollub T, Tippmann M, Kiesel J, Rosso P, Stamatatos E, Stein B (2013) Overview of the 5th international competition on plagiarism detection. In: CEUR workshop proceedings, 1179Google Scholar
  37. 37.
    Potthast M, Stein B, Barrón-cedeño A, Rosso P (2010) An evaluation framework for plagiarism detection. In: Proceedings of the 23rd international conference on computational linguistics (COLING 2010) (August), pp 997–1005. Retrieved from
  38. 38.
    Qimin C, Qiao G, Yongliang W, Xianghua W (2015) Text clustering using VSM with feature clusters. Neural Comput Appl 26(4):995–1003CrossRefGoogle Scholar
  39. 39.
    Rodríguez Torrejón D, Martín Ramos J (2014) CoReMo 2.3 plagiarism detector text alignment module: notebook for PAN at CLEF 2014. In: CEUR workshop proceedings, 1180, pp 997–1003Google Scholar
  40. 40.
    Sanchez-Perez MA, Sidorov G, Gelbukh A (2014) The winning approach to text alignment for text reuse detection at PAN 2014: notebook for PAN at CLEF 2014. In: CEUR workshop proceedings, 1180, pp 1004–1011Google Scholar
  41. 41.
    Sánchez-Vega F, Villatoro-Tello E, Montes-y-Gómez M, Rosso P, Stamatatos E, Villaseñor-Pineda L (2019) Paraphrase plagiarism identification with character-level features. Pattern Anal Appl 22(2):669–681MathSciNetCrossRefGoogle Scholar
  42. 42.
    Shrestha P, Maharjan S, Solorio T (2014) Machine translation evaluation metric for text alignment. CLEF (working notes), pp 1012–1016Google Scholar
  43. 43.
    Shrestha P, Solorio T (2013) Using a variety of n-grams for the detection of different kinds of plagiarism. Notebook for PAN at CLEFGoogle Scholar
  44. 44.
    Socher R (2014) Recursive deep learning for natural language processing and computer vision. Ph.D. thesis (August).
  45. 45.
    Socher R, Huang E, Pennington J (2011) Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Advances in neural information processing systems, pp 801–809. Retrieved from Scholar
  46. 46.
    Socher R, Manning CDC, Ng AYA (2010) Learning continuous phrase representations and syntactic parsing with recursive neural networks. In: Proceedings of the NIPS-2010 deep learning and unsupervised feature learning workshop, pp 1–9. zbMATHGoogle Scholar
  47. 47.
    Socher R, Manning C, Huval B, Ng A (2012) Semantic compositionality through recursive matrix-vector spaces. In: EMNLP-CoNLL’12: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pp 1201–1211.
  48. 48.
    Suchomel Š, Kasprzak J, Brandejs M et al (2013) Diverse queries and feature type selection for plagiarism discovery. Notebook for PAN at CLEF 2013Google Scholar
  49. 49.
    Tai KS, Socher R, Manning CD (2015) Improved semantic representations from tree-structured long short-term memory networks. Proc ACL. CrossRefGoogle Scholar
  50. 50.
    Talebpour A, Shirzadi M, Aminolroaya Z (2016) Plagiarism detection based on a novel trie-based approach. In: CEUR workshop proceedings, 1737, pp 180–183Google Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2019

Authors and Affiliations

  1. 1.Data and Signal Processing Lab, Faculty of New Sciences and TechnologiesUniversity of TehranTehranIran
  2. 2.PRHLT Research CenterUniversitat Politècnica de ValènciaValenciaSpain

Personalised recommendations