Abstract
Plagiarism detection for a huge amount of document data requires efficient methods. This paper proposes a plagiarism detection algorithm based on approximate string matching and vector representation of words, and a speed improvement to an implementation of the algorithm. The effect of the improvement on the algorithm is evaluated by conducting experiments with a dataset. The experimental results show a tradeoff between the processing time and the accuracy of the plagiarism detection algorithm, which enable us to configure its implementation in accordance with a given data space and a required accuracy.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Nature. http://www.nature.com/nature/. Accessed Jan. 15, 2018.
K. Abrahamson. Generalized string matching. SIAM Journal on Computing, 16(6):1039–1051, 1987.
M. J. Atallah, F. Chyzak, and P. Dumas. A randomized algorithm for approximate string matching. Algorithmica, 29(3):468–486, 2001.
M. J. Atallah, E. Grigorescu, and Y. Wu. A lower-variance randomized algorithm for approximate string matching. Information Processing Letters, 113(18):690–692, 2013.
K. Baba. String matching with mismatches by real-valued FFT. In D. Taniar, O. Gervasi, B. Murgante, E. Pardede, and B. O. Apduhan, editors, Computational Science and Its Applications – ICCSA 2010: International Conference, Fukuoka, Japan, March 23–26, 2010, Proceedings, Part IV, pages 273–283, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.
K. Baba. An acceleration of FFT-based algorithms for the match-count problem. Information Processing Letters, 125:1–4, 2017.
K. Baba. An extension of the FFT-based algorithm for the match-count problem to weighted scores. IEEJ Transactions on Electrical and Electronic Engineering, 12(S5):97–100, 2017.
K. Baba. A fast algorithm for plagiarism detection in large-scale data. Journal of Digital Infromation Management, 15(6):331–338, 2017.
K. Baba. Fast plagiarism detection based on simple document similarity. In Proceedings of the Twelfth International Conference on Digital Information Management, pages 49–53. IEEE, 2017.
K. Baba, A. Shinohara, M. Takeda, S. Inenaga, and S. Arikawa. A note on randomized algorithm for string matching with mismatches. Nordic J. of Computing, 10(1):2–12, Mar. 2003.
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan):993–1022, 2003.
J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of Computation, 19(90):297–301, 1965.
T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson. Introduction to Algorithms. McGraw-Hill Higher Education, 2nd edition, 2001.
M. Crochemore and W. Rytter. Jewels of Stringology. World Scientific, 2003.
J. Ferrero, F. Agnes, L. Besacier, and D. Schwab. Using word embedding for cross-language plagiarism detection. Technical report, 2017. arXiv:1702.03082v1.
M. J. Fischer and M. S. Paterson. String-matching and other products. In Complexity of Computation (Proceedings of the SIAM-AMS Applied Mathematics Symposium, New York, 1973), pages 113–125, 1974.
D. Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York, NY, USA, 1997.
G. E. Hinton, J. L. McClelland, and D. E. Rumelhart. Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1. chapter Distributed Representations, pages 77–109. MIT Press, Cambridge, MA, USA, 1986.
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, pages 604–613, New York, NY, USA, 1998. ACM.
R. W. Irving. Plagiarism and collusion detection using the smith-waterman algorithm. Technical report, 2004.
T. K. Landauer, P. W. Foltz, and D. Laham. Introduction to latent semantic analysis. Discourse Processes, 25:259–284, 1998.
W.-Y. Lin, N. Peng, C.-C. Yen, and S.-d. Lin. Online plagiarism detection through exploiting lexical, syntactic, and semantic information. In Proceedings of the ACL 2012 System Demonstrations, ACL ’12, pages 145–150, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics.
R. Lukashenko, V. Graudina, and J. Grundspenkis. Computer-based plagiarism detection methods and tools: An overview. In Proceedings of the 2007 International Conference on Computer Systems and Technologies, pages 1–6. ACM, 2007.
C. D. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Cambridge University Press, 2008.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc., 2013.
H. Misra, O. Cappé, and F. Yvon. Using LDA to detect semantically incoherent documents. In Proceedings of the Twelfth Conference on Computational Natural Language Learning, CoNLL ’08, pages 41–48, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics.
W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. In Proc. Natl. Acad. Sci. USA, volume 85, pages 2444–2448, 1988.
R. Řehůřek. Plagiarism detection through vector space models applied to a digital library. In RASLAN 2008, pages 75–83, Brno, 2008. Masarykova Univerzita.
T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. J. Mol. Biol., 147:195–197, 1981.
Z. Su, B.-R. Ahn, K.-Y. Eom, M.-K. Kang, J.-P. Kim, and M.-K. Kim. Plagiarism detection using the Levenshtein distance and Smith-Waterman algorithm. In Innovative Computing Information and Control, page 569, 2008.
R. A. Wagner and M. J. Fischer. The string-to-string correction problem. J. ACM, 21(1):168–173, 1974.
Acknowledgements
This work was supported by JSPS KAKENHI Grant Number 15K00310.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Baba, K. (2018). Fast Plagiarism Detection Using Approximate String Matching and Vector Representation of Words. In: Wong, R., Chi, CH., Hung, P. (eds) Behavior Engineering and Applications. International Series on Computer Entertainment and Media Technology. Springer, Cham. https://doi.org/10.1007/978-3-319-76430-6_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-76430-6_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-76429-0
Online ISBN: 978-3-319-76430-6
eBook Packages: Computer ScienceComputer Science (R0)