Fast Plagiarism Detection Using Approximate String Matching and Vector Representation of Words
Plagiarism detection for a huge amount of document data requires efficient methods. This paper proposes a plagiarism detection algorithm based on approximate string matching and vector representation of words, and a speed improvement to an implementation of the algorithm. The effect of the improvement on the algorithm is evaluated by conducting experiments with a dataset. The experimental results show a tradeoff between the processing time and the accuracy of the plagiarism detection algorithm, which enable us to configure its implementation in accordance with a given data space and a required accuracy.
This work was supported by JSPS KAKENHI Grant Number 15K00310.
- 1.Nature. http://www.nature.com/nature/. Accessed Jan. 15, 2018.
- 5.K. Baba. String matching with mismatches by real-valued FFT. In D. Taniar, O. Gervasi, B. Murgante, E. Pardede, and B. O. Apduhan, editors, Computational Science and Its Applications – ICCSA 2010: International Conference, Fukuoka, Japan, March 23–26, 2010, Proceedings, Part IV, pages 273–283, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.CrossRefGoogle Scholar
- 8.K. Baba. A fast algorithm for plagiarism detection in large-scale data. Journal of Digital Infromation Management, 15(6):331–338, 2017.Google Scholar
- 9.K. Baba. Fast plagiarism detection based on simple document similarity. In Proceedings of the Twelfth International Conference on Digital Information Management, pages 49–53. IEEE, 2017.Google Scholar
- 10.K. Baba, A. Shinohara, M. Takeda, S. Inenaga, and S. Arikawa. A note on randomized algorithm for string matching with mismatches. Nordic J. of Computing, 10(1):2–12, Mar. 2003.Google Scholar
- 11.D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan):993–1022, 2003.Google Scholar
- 13.T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson. Introduction to Algorithms. McGraw-Hill Higher Education, 2nd edition, 2001.Google Scholar
- 14.M. Crochemore and W. Rytter. Jewels of Stringology. World Scientific, 2003.Google Scholar
- 15.J. Ferrero, F. Agnes, L. Besacier, and D. Schwab. Using word embedding for cross-language plagiarism detection. Technical report, 2017. arXiv:1702.03082v1.Google Scholar
- 16.M. J. Fischer and M. S. Paterson. String-matching and other products. In Complexity of Computation (Proceedings of the SIAM-AMS Applied Mathematics Symposium, New York, 1973), pages 113–125, 1974.Google Scholar
- 17.D. Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York, NY, USA, 1997.Google Scholar
- 18.G. E. Hinton, J. L. McClelland, and D. E. Rumelhart. Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1. chapter Distributed Representations, pages 77–109. MIT Press, Cambridge, MA, USA, 1986.Google Scholar
- 19.P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, pages 604–613, New York, NY, USA, 1998. ACM.Google Scholar
- 20.R. W. Irving. Plagiarism and collusion detection using the smith-waterman algorithm. Technical report, 2004.Google Scholar
- 22.W.-Y. Lin, N. Peng, C.-C. Yen, and S.-d. Lin. Online plagiarism detection through exploiting lexical, syntactic, and semantic information. In Proceedings of the ACL 2012 System Demonstrations, ACL ’12, pages 145–150, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics.Google Scholar
- 23.R. Lukashenko, V. Graudina, and J. Grundspenkis. Computer-based plagiarism detection methods and tools: An overview. In Proceedings of the 2007 International Conference on Computer Systems and Technologies, pages 1–6. ACM, 2007.Google Scholar
- 24.C. D. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Cambridge University Press, 2008.Google Scholar
- 25.T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc., 2013.Google Scholar
- 26.H. Misra, O. Cappé, and F. Yvon. Using LDA to detect semantically incoherent documents. In Proceedings of the Twelfth Conference on Computational Natural Language Learning, CoNLL ’08, pages 41–48, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics.Google Scholar
- 28.R. Řehůřek. Plagiarism detection through vector space models applied to a digital library. In RASLAN 2008, pages 75–83, Brno, 2008. Masarykova Univerzita.Google Scholar
- 30.Z. Su, B.-R. Ahn, K.-Y. Eom, M.-K. Kang, J.-P. Kim, and M.-K. Kim. Plagiarism detection using the Levenshtein distance and Smith-Waterman algorithm. In Innovative Computing Information and Control, page 569, 2008.Google Scholar