Skip to main content

Fast Plagiarism Detection Using Approximate String Matching and Vector Representation of Words

  • Chapter
  • First Online:

Abstract

Plagiarism detection for a huge amount of document data requires efficient methods. This paper proposes a plagiarism detection algorithm based on approximate string matching and vector representation of words, and a speed improvement to an implementation of the algorithm. The effect of the improvement on the algorithm is evaluated by conducting experiments with a dataset. The experimental results show a tradeoff between the processing time and the accuracy of the plagiarism detection algorithm, which enable us to configure its implementation in accordance with a given data space and a required accuracy.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Nature. http://www.nature.com/nature/. Accessed Jan. 15, 2018.

  2. K. Abrahamson. Generalized string matching. SIAM Journal on Computing, 16(6):1039–1051, 1987.

    Article  Google Scholar 

  3. M. J. Atallah, F. Chyzak, and P. Dumas. A randomized algorithm for approximate string matching. Algorithmica, 29(3):468–486, 2001.

    Article  Google Scholar 

  4. M. J. Atallah, E. Grigorescu, and Y. Wu. A lower-variance randomized algorithm for approximate string matching. Information Processing Letters, 113(18):690–692, 2013.

    Article  Google Scholar 

  5. K. Baba. String matching with mismatches by real-valued FFT. In D. Taniar, O. Gervasi, B. Murgante, E. Pardede, and B. O. Apduhan, editors, Computational Science and Its Applications – ICCSA 2010: International Conference, Fukuoka, Japan, March 23–26, 2010, Proceedings, Part IV, pages 273–283, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.

    Chapter  Google Scholar 

  6. K. Baba. An acceleration of FFT-based algorithms for the match-count problem. Information Processing Letters, 125:1–4, 2017.

    Article  Google Scholar 

  7. K. Baba. An extension of the FFT-based algorithm for the match-count problem to weighted scores. IEEJ Transactions on Electrical and Electronic Engineering, 12(S5):97–100, 2017.

    Article  Google Scholar 

  8. K. Baba. A fast algorithm for plagiarism detection in large-scale data. Journal of Digital Infromation Management, 15(6):331–338, 2017.

    Google Scholar 

  9. K. Baba. Fast plagiarism detection based on simple document similarity. In Proceedings of the Twelfth International Conference on Digital Information Management, pages 49–53. IEEE, 2017.

    Google Scholar 

  10. K. Baba, A. Shinohara, M. Takeda, S. Inenaga, and S. Arikawa. A note on randomized algorithm for string matching with mismatches. Nordic J. of Computing, 10(1):2–12, Mar. 2003.

    Google Scholar 

  11. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan):993–1022, 2003.

    Google Scholar 

  12. J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of Computation, 19(90):297–301, 1965.

    Article  Google Scholar 

  13. T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson. Introduction to Algorithms. McGraw-Hill Higher Education, 2nd edition, 2001.

    Google Scholar 

  14. M. Crochemore and W. Rytter. Jewels of Stringology. World Scientific, 2003.

    Google Scholar 

  15. J. Ferrero, F. Agnes, L. Besacier, and D. Schwab. Using word embedding for cross-language plagiarism detection. Technical report, 2017. arXiv:1702.03082v1.

    Google Scholar 

  16. M. J. Fischer and M. S. Paterson. String-matching and other products. In Complexity of Computation (Proceedings of the SIAM-AMS Applied Mathematics Symposium, New York, 1973), pages 113–125, 1974.

    Google Scholar 

  17. D. Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York, NY, USA, 1997.

    Google Scholar 

  18. G. E. Hinton, J. L. McClelland, and D. E. Rumelhart. Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1. chapter Distributed Representations, pages 77–109. MIT Press, Cambridge, MA, USA, 1986.

    Google Scholar 

  19. P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, pages 604–613, New York, NY, USA, 1998. ACM.

    Google Scholar 

  20. R. W. Irving. Plagiarism and collusion detection using the smith-waterman algorithm. Technical report, 2004.

    Google Scholar 

  21. T. K. Landauer, P. W. Foltz, and D. Laham. Introduction to latent semantic analysis. Discourse Processes, 25:259–284, 1998.

    Article  Google Scholar 

  22. W.-Y. Lin, N. Peng, C.-C. Yen, and S.-d. Lin. Online plagiarism detection through exploiting lexical, syntactic, and semantic information. In Proceedings of the ACL 2012 System Demonstrations, ACL ’12, pages 145–150, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics.

    Google Scholar 

  23. R. Lukashenko, V. Graudina, and J. Grundspenkis. Computer-based plagiarism detection methods and tools: An overview. In Proceedings of the 2007 International Conference on Computer Systems and Technologies, pages 1–6. ACM, 2007.

    Google Scholar 

  24. C. D. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Cambridge University Press, 2008.

    Google Scholar 

  25. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc., 2013.

    Google Scholar 

  26. H. Misra, O. Cappé, and F. Yvon. Using LDA to detect semantically incoherent documents. In Proceedings of the Twelfth Conference on Computational Natural Language Learning, CoNLL ’08, pages 41–48, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics.

    Google Scholar 

  27. W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. In Proc. Natl. Acad. Sci. USA, volume 85, pages 2444–2448, 1988.

    Article  Google Scholar 

  28. R. Řehůřek. Plagiarism detection through vector space models applied to a digital library. In RASLAN 2008, pages 75–83, Brno, 2008. Masarykova Univerzita.

    Google Scholar 

  29. T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. J. Mol. Biol., 147:195–197, 1981.

    Article  Google Scholar 

  30. Z. Su, B.-R. Ahn, K.-Y. Eom, M.-K. Kang, J.-P. Kim, and M.-K. Kim. Plagiarism detection using the Levenshtein distance and Smith-Waterman algorithm. In Innovative Computing Information and Control, page 569, 2008.

    Google Scholar 

  31. R. A. Wagner and M. J. Fischer. The string-to-string correction problem. J. ACM, 21(1):168–173, 1974.

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by JSPS KAKENHI Grant Number 15K00310.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kensuke Baba .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Baba, K. (2018). Fast Plagiarism Detection Using Approximate String Matching and Vector Representation of Words. In: Wong, R., Chi, CH., Hung, P. (eds) Behavior Engineering and Applications. International Series on Computer Entertainment and Media Technology. Springer, Cham. https://doi.org/10.1007/978-3-319-76430-6_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-76430-6_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-76429-0

  • Online ISBN: 978-3-319-76430-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics