Fast Plagiarism Detection Using Approximate String Matching and Vector Representation of Words

  • Kensuke BabaEmail author
Part of the International Series on Computer Entertainment and Media Technology book series (ISCEMT)


Plagiarism detection for a huge amount of document data requires efficient methods. This paper proposes a plagiarism detection algorithm based on approximate string matching and vector representation of words, and a speed improvement to an implementation of the algorithm. The effect of the improvement on the algorithm is evaluated by conducting experiments with a dataset. The experimental results show a tradeoff between the processing time and the accuracy of the plagiarism detection algorithm, which enable us to configure its implementation in accordance with a given data space and a required accuracy.



This work was supported by JSPS KAKENHI Grant Number 15K00310.


  1. 1.
    Nature. Accessed Jan. 15, 2018.
  2. 2.
    K. Abrahamson. Generalized string matching. SIAM Journal on Computing, 16(6):1039–1051, 1987.CrossRefGoogle Scholar
  3. 3.
    M. J. Atallah, F. Chyzak, and P. Dumas. A randomized algorithm for approximate string matching. Algorithmica, 29(3):468–486, 2001.CrossRefGoogle Scholar
  4. 4.
    M. J. Atallah, E. Grigorescu, and Y. Wu. A lower-variance randomized algorithm for approximate string matching. Information Processing Letters, 113(18):690–692, 2013.CrossRefGoogle Scholar
  5. 5.
    K. Baba. String matching with mismatches by real-valued FFT. In D. Taniar, O. Gervasi, B. Murgante, E. Pardede, and B. O. Apduhan, editors, Computational Science and Its Applications – ICCSA 2010: International Conference, Fukuoka, Japan, March 23–26, 2010, Proceedings, Part IV, pages 273–283, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.CrossRefGoogle Scholar
  6. 6.
    K. Baba. An acceleration of FFT-based algorithms for the match-count problem. Information Processing Letters, 125:1–4, 2017.CrossRefGoogle Scholar
  7. 7.
    K. Baba. An extension of the FFT-based algorithm for the match-count problem to weighted scores. IEEJ Transactions on Electrical and Electronic Engineering, 12(S5):97–100, 2017.CrossRefGoogle Scholar
  8. 8.
    K. Baba. A fast algorithm for plagiarism detection in large-scale data. Journal of Digital Infromation Management, 15(6):331–338, 2017.Google Scholar
  9. 9.
    K. Baba. Fast plagiarism detection based on simple document similarity. In Proceedings of the Twelfth International Conference on Digital Information Management, pages 49–53. IEEE, 2017.Google Scholar
  10. 10.
    K. Baba, A. Shinohara, M. Takeda, S. Inenaga, and S. Arikawa. A note on randomized algorithm for string matching with mismatches. Nordic J. of Computing, 10(1):2–12, Mar. 2003.Google Scholar
  11. 11.
    D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan):993–1022, 2003.Google Scholar
  12. 12.
    J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of Computation, 19(90):297–301, 1965.CrossRefGoogle Scholar
  13. 13.
    T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson. Introduction to Algorithms. McGraw-Hill Higher Education, 2nd edition, 2001.Google Scholar
  14. 14.
    M. Crochemore and W. Rytter. Jewels of Stringology. World Scientific, 2003.Google Scholar
  15. 15.
    J. Ferrero, F. Agnes, L. Besacier, and D. Schwab. Using word embedding for cross-language plagiarism detection. Technical report, 2017. arXiv:1702.03082v1.Google Scholar
  16. 16.
    M. J. Fischer and M. S. Paterson. String-matching and other products. In Complexity of Computation (Proceedings of the SIAM-AMS Applied Mathematics Symposium, New York, 1973), pages 113–125, 1974.Google Scholar
  17. 17.
    D. Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York, NY, USA, 1997.Google Scholar
  18. 18.
    G. E. Hinton, J. L. McClelland, and D. E. Rumelhart. Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1. chapter Distributed Representations, pages 77–109. MIT Press, Cambridge, MA, USA, 1986.Google Scholar
  19. 19.
    P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, pages 604–613, New York, NY, USA, 1998. ACM.Google Scholar
  20. 20.
    R. W. Irving. Plagiarism and collusion detection using the smith-waterman algorithm. Technical report, 2004.Google Scholar
  21. 21.
    T. K. Landauer, P. W. Foltz, and D. Laham. Introduction to latent semantic analysis. Discourse Processes, 25:259–284, 1998.CrossRefGoogle Scholar
  22. 22.
    W.-Y. Lin, N. Peng, C.-C. Yen, and S.-d. Lin. Online plagiarism detection through exploiting lexical, syntactic, and semantic information. In Proceedings of the ACL 2012 System Demonstrations, ACL ’12, pages 145–150, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics.Google Scholar
  23. 23.
    R. Lukashenko, V. Graudina, and J. Grundspenkis. Computer-based plagiarism detection methods and tools: An overview. In Proceedings of the 2007 International Conference on Computer Systems and Technologies, pages 1–6. ACM, 2007.Google Scholar
  24. 24.
    C. D. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Cambridge University Press, 2008.Google Scholar
  25. 25.
    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc., 2013.Google Scholar
  26. 26.
    H. Misra, O. Cappé, and F. Yvon. Using LDA to detect semantically incoherent documents. In Proceedings of the Twelfth Conference on Computational Natural Language Learning, CoNLL ’08, pages 41–48, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics.Google Scholar
  27. 27.
    W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. In Proc. Natl. Acad. Sci. USA, volume 85, pages 2444–2448, 1988.CrossRefGoogle Scholar
  28. 28.
    R. Řehůřek. Plagiarism detection through vector space models applied to a digital library. In RASLAN 2008, pages 75–83, Brno, 2008. Masarykova Univerzita.Google Scholar
  29. 29.
    T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. J. Mol. Biol., 147:195–197, 1981.CrossRefGoogle Scholar
  30. 30.
    Z. Su, B.-R. Ahn, K.-Y. Eom, M.-K. Kang, J.-P. Kim, and M.-K. Kim. Plagiarism detection using the Levenshtein distance and Smith-Waterman algorithm. In Innovative Computing Information and Control, page 569, 2008.Google Scholar
  31. 31.
    R. A. Wagner and M. J. Fischer. The string-to-string correction problem. J. ACM, 21(1):168–173, 1974.CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Fujitsu LaboratoriesKawasakiJapan

Personalised recommendations