Filtering Documents for Plagiarism Detection

  • Kensuke BabaEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11198)


Efficient methods are required for plagiarism detection. This paper proposes a fast and scalable method for detecting “copy and paste”-type plagiarism in documents. Implementing detection methods for this type of plagiarism requires a long processing time or a large database for comprehensive matching of ordered word occurrences. The author improved the scalability of an existing fast method based on fast Fourier transform using the idea of the frequency domain filtering. He evaluated the effect of the improvement on accuracy of the plagiarism detection method, and achieved an effective trade-off between the accuracy and the required size of database.


Plagiarism detection Text processing Vector representation of words Fast Fourier transform Filtering 


  1. 1.
    Nature. Accessed 15 Jan 2018
  2. 2.
    Atallah, M.J., Chyzak, F., Dumas, P.: A randomized algorithm for approximate string matching. Algorithmica 29(3), 468–486 (2001)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Baba, K.: String matching with mismatches by real-valued FFT. In: Taniar, D., Gervasi, O., Murgante, B., Pardede, E., Apduhan, B.O. (eds.) ICCSA 2010. LNCS, vol. 6019, pp. 273–283. Springer, Heidelberg (2010). Scholar
  4. 4.
    Baba, K.: An acceleration of FFT-based algorithms for the match-count problem. Inf. Process. Lett. 125, 1–4 (2017)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Baba, K.: An extension of the FFT-based algorithm for the match-count problem to weighted scores. IEEJ Trans. Electr. Electron. Eng. 12(S5), 97–100 (2017)CrossRefGoogle Scholar
  6. 6.
    Baba, K.: A fast algorithm for plagiarism detection in large-scale data. J. Digit. Inf. Manag. 15(6), 331–338 (2017)Google Scholar
  7. 7.
    Baba, K.: Fast plagiarism detection based on simple document similarity. In: Proceedings of the Twelfth International Conference on Digital Information Management, pp. 49–53. IEEE (2017)Google Scholar
  8. 8.
    Baba, K.: Fast plagiarism detection using approximate string matching and vector representation of words. In: Wong, R., Chi, C.-H., Hung, P.C.K. (eds.) Behavior Engineering and Applications. ISCEMT, pp. 67–79. Springer, Cham (2018). Scholar
  9. 9.
    Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex fourier series. Math. Comput. 19(90), 297–301 (1965)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Cormen, T.H., Stein, C., Rivest, R.L., Leiserson, C.E.: Introduction to Algorithms, 2nd edn. McGraw-Hill Higher Education, Boston (2001)zbMATHGoogle Scholar
  11. 11.
    Fischer, M.J., Paterson, M.S.: String-matching and other products. In: Complexity of Computation (Proceedings of the SIAM-AMS Applied Mathematics Symposium, New York, 1973), pp. 113–125 (1974)Google Scholar
  12. 12.
    Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York (1997)CrossRefGoogle Scholar
  13. 13.
    Irving, R.W.: Plagiarism and collusion detection using the smith-waterman algorithm. Technical report (2004)Google Scholar
  14. 14.
    Jain, A.K.: Fundamentals of Digital Image Processing. Prentice-Hall Inc., Upper Saddle River (1989)zbMATHGoogle Scholar
  15. 15.
    Lin, W.-Y., Peng, N., Yen, C.-C., Lin, S.-D.: Online plagiarism detection through exploiting lexical, syntactic, and semantic information. In: Proceedings of the ACL 2012 System Demonstrations, ACL 2012, Stroudsburg, PA, USA, pp. 145–150. Association for Computational Linguistics (2012)Google Scholar
  16. 16.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRefGoogle Scholar
  17. 17.
    Mikolov, T., Sutskever, I. Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 26, pp. 3111–3119. Curran Associates Inc. (2013)Google Scholar
  18. 18.
    Misra, H., Cappé, O., Yvon, F.: Using LDA to detect semantically incoherent documents. In: Proceedings of the Twelfth Conference on Computational Natural Language Learning, CoNLL 2008, Stroudsburg, PA, USA, pp. 41–48. Association for Computational Linguistics (2008)Google Scholar
  19. 19.
    Řehůřek, R.: Plagiarism detection through vector space models applied to a digital library. In: RASLAN 2008, Brno, pp. 75–83. Masarykova Univerzita (2008)Google Scholar
  20. 20.
    Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)CrossRefGoogle Scholar
  21. 21.
    Su, Z., Ahn, B.-R., Eom, K.-Y., Kang, M.-K., Kim, J.-P., Kim, M.-K.: Plagiarism detection using the Levenshtein distance and Smith-Waterman algorithm. In: Innovative Computing Information and Control, p. 569 (2008)Google Scholar
  22. 22.
    Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. J. ACM 21(1), 168–173 (1974)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Fujitsu LaboratoriesKawasakiJapan

Personalised recommendations