Fast Plagiarism Detection Using Approximate String Matching and Vector Representation of Words

Baba, Kensuke

doi:10.1007/978-3-319-76430-6_3

Fast Plagiarism Detection Using Approximate String Matching and Vector Representation of Words

Kensuke Baba⁵

Chapter
First Online: 11 July 2018

485 Accesses
1 Citations

Part of the book series: International Series on Computer Entertainment and Media Technology ((ISCEMT))

Abstract

Plagiarism detection for a huge amount of document data requires efficient methods. This paper proposes a plagiarism detection algorithm based on approximate string matching and vector representation of words, and a speed improvement to an implementation of the algorithm. The effect of the improvement on the algorithm is evaluated by conducting experiments with a dataset. The experimental results show a tradeoff between the processing time and the accuracy of the plagiarism detection algorithm, which enable us to configure its implementation in accordance with a given data space and a required accuracy.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Nature. http://www.nature.com/nature/. Accessed Jan. 15, 2018.
K. Abrahamson. Generalized string matching. SIAM Journal on Computing, 16(6):1039–1051, 1987.
Article Google Scholar
M. J. Atallah, F. Chyzak, and P. Dumas. A randomized algorithm for approximate string matching. Algorithmica, 29(3):468–486, 2001.
Article Google Scholar
M. J. Atallah, E. Grigorescu, and Y. Wu. A lower-variance randomized algorithm for approximate string matching. Information Processing Letters, 113(18):690–692, 2013.
Article Google Scholar
K. Baba. String matching with mismatches by real-valued FFT. In D. Taniar, O. Gervasi, B. Murgante, E. Pardede, and B. O. Apduhan, editors, Computational Science and Its Applications – ICCSA 2010: International Conference, Fukuoka, Japan, March 23–26, 2010, Proceedings, Part IV, pages 273–283, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.
Chapter Google Scholar
K. Baba. An acceleration of FFT-based algorithms for the match-count problem. Information Processing Letters, 125:1–4, 2017.
Article Google Scholar
K. Baba. An extension of the FFT-based algorithm for the match-count problem to weighted scores. IEEJ Transactions on Electrical and Electronic Engineering, 12(S5):97–100, 2017.
Article Google Scholar
K. Baba. A fast algorithm for plagiarism detection in large-scale data. Journal of Digital Infromation Management, 15(6):331–338, 2017.
Google Scholar
K. Baba. Fast plagiarism detection based on simple document similarity. In Proceedings of the Twelfth International Conference on Digital Information Management, pages 49–53. IEEE, 2017.
Google Scholar
K. Baba, A. Shinohara, M. Takeda, S. Inenaga, and S. Arikawa. A note on randomized algorithm for string matching with mismatches. Nordic J. of Computing, 10(1):2–12, Mar. 2003.
Google Scholar
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan):993–1022, 2003.
Google Scholar
J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of Computation, 19(90):297–301, 1965.
Article Google Scholar
T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson. Introduction to Algorithms. McGraw-Hill Higher Education, 2nd edition, 2001.
Google Scholar
M. Crochemore and W. Rytter. Jewels of Stringology. World Scientific, 2003.
Google Scholar
J. Ferrero, F. Agnes, L. Besacier, and D. Schwab. Using word embedding for cross-language plagiarism detection. Technical report, 2017. arXiv:1702.03082v1.
Google Scholar
M. J. Fischer and M. S. Paterson. String-matching and other products. In Complexity of Computation (Proceedings of the SIAM-AMS Applied Mathematics Symposium, New York, 1973), pages 113–125, 1974.
Google Scholar
D. Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York, NY, USA, 1997.
Google Scholar
G. E. Hinton, J. L. McClelland, and D. E. Rumelhart. Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1. chapter Distributed Representations, pages 77–109. MIT Press, Cambridge, MA, USA, 1986.
Google Scholar
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, pages 604–613, New York, NY, USA, 1998. ACM.
Google Scholar
R. W. Irving. Plagiarism and collusion detection using the smith-waterman algorithm. Technical report, 2004.
Google Scholar
T. K. Landauer, P. W. Foltz, and D. Laham. Introduction to latent semantic analysis. Discourse Processes, 25:259–284, 1998.
Article Google Scholar
W.-Y. Lin, N. Peng, C.-C. Yen, and S.-d. Lin. Online plagiarism detection through exploiting lexical, syntactic, and semantic information. In Proceedings of the ACL 2012 System Demonstrations, ACL ’12, pages 145–150, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics.
Google Scholar
R. Lukashenko, V. Graudina, and J. Grundspenkis. Computer-based plagiarism detection methods and tools: An overview. In Proceedings of the 2007 International Conference on Computer Systems and Technologies, pages 1–6. ACM, 2007.
Google Scholar
C. D. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Cambridge University Press, 2008.
Google Scholar
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc., 2013.
Google Scholar
H. Misra, O. Cappé, and F. Yvon. Using LDA to detect semantically incoherent documents. In Proceedings of the Twelfth Conference on Computational Natural Language Learning, CoNLL ’08, pages 41–48, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics.
Google Scholar
W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. In Proc. Natl. Acad. Sci. USA, volume 85, pages 2444–2448, 1988.
Article Google Scholar
R. Řehůřek. Plagiarism detection through vector space models applied to a digital library. In RASLAN 2008, pages 75–83, Brno, 2008. Masarykova Univerzita.
Google Scholar
T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. J. Mol. Biol., 147:195–197, 1981.
Article Google Scholar
Z. Su, B.-R. Ahn, K.-Y. Eom, M.-K. Kang, J.-P. Kim, and M.-K. Kim. Plagiarism detection using the Levenshtein distance and Smith-Waterman algorithm. In Innovative Computing Information and Control, page 569, 2008.
Google Scholar
R. A. Wagner and M. J. Fischer. The string-to-string correction problem. J. ACM, 21(1):168–173, 1974.
Article Google Scholar

Download references

Acknowledgements

This work was supported by JSPS KAKENHI Grant Number 15K00310.

Author information

Authors and Affiliations

Fujitsu Laboratories, Kawasaki, Japan
Kensuke Baba

Authors

Kensuke Baba
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kensuke Baba .

Editor information

Editors and Affiliations

School of Computer Science and Engineering, University of New South Wales, Sydney, New South Wales, Australia
Raymond Wong
CSIRO, Hobart, Tasmania, Australia
Chi-Hung Chi
Faculty of Business and Information Technology, University of Ontario Institute of Technology, Oshawa, Ontario, Canada
Patrick C. K. Hung

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Baba, K. (2018). Fast Plagiarism Detection Using Approximate String Matching and Vector Representation of Words. In: Wong, R., Chi, CH., Hung, P. (eds) Behavior Engineering and Applications. International Series on Computer Entertainment and Media Technology. Springer, Cham. https://doi.org/10.1007/978-3-319-76430-6_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-76430-6_3
Published: 11 July 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-76429-0
Online ISBN: 978-3-319-76430-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics