Abstract
This paper presents a new method for Cross-Language Plagiarism Analysis. Our task is to detect the plagiarized passages in the suspicious documents and their corresponding fragments in the source documents. We propose a plagiarism detection method composed by five main phases: language normalization, retrieval of candidate documents, classifier training, plagiarism analysis, and post-processing. To evaluate our method, we created a corpus containing artificial plagiarism offenses. Two different experiments were conducted; the first one considers only monolingual plagiarism cases, while the second one considers only cross-language plagiarism cases. The results showed that the cross-language experiment achieved 86% of the performance of the monolingual baseline. We also analyzed how the plagiarized text length affects the overall performance of the method. This analysis showed that our method achieved better results with medium and large plagiarized passages.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Argamon, S., Levitan, S.: Measuring the Usefulness of Function Words for Authorship Attribution. In: Association for Literary and Linguistic Computing/ Association Computer Humanities (2005)
Barrón-Cedeño, A., Rosso, P.: On Automatic Plagiarism Detection Based on n-Grams Comparison. In: Proceedings of the 31st European Conference on IR Research on Advances in Information Retrieval, pp. 696–700 (2009)
Barrón-Cedeño, A., Rosso, P., Benedí, J.-M.: Reducing the Plagiarism Detection Search Space on the basis of the Kullback-Leibler Distance. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 523–534. Springer, Heidelberg (2009)
Barrón-Cedeño, A., Rosso, P., Pinto, D., Juan, A.: On Cross-lingual Plagiarism Analysis using a Statistical Model. In: Proceedings of the ECAI 2008 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse (2008)
Ceska, Z., Toman, M., Jezek, K.: Multilingual Plagiarism Detection. In: Dochev, D., Pistore, M., Traverso, P. (eds.) AIMSA 2008. LNCS (LNAI), vol. 5253, pp. 83–92. Springer, Heidelberg (2008)
Google Translator, http://www.google.com/translate_t
Grefenstette, G.: Cross-Language Information Retrieval, p. 182. Kluwer Academic Publishers, Boston (1998)
Grozea, C., Gehl, C., Popescu, M.: ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection. In: Proceedings of the SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, pp. 10–18 (2009)
Hull, D.A., Grefenstette, G.: Querying Across Languages, a Dictionary-based approach to Multilingual Information Retrieval. In: 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1996)
Kasprzak, J., Brandejs, M., Křipač, M.: Finding Plagiarism by Evaluating Document Similarities. In: Proceedings of the SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, pp. 24–28 (2009)
Koehn, P.E.: A Parallel Corpus for Statistical Machine Translation. In: MT Summit (2005)
Koppel, M., Schler, J.: Authorship Verification as a One-Class Classification Problem. In: Proceedings of the 21st International Conference on Machine Learning. ACM, New York (2004)
Lathrop, A., Foss, K.: Student Cheating and Plagiarism in the Internet Era. A Wake-Up Call, p. 255. Libraries Unlimited, Inc., Englewood (2000)
LEC Power Translator, http://www.lec.com/power-translator-software.asp
Malyutov, M.B.: Authorship Attribution of Texts: A Review in General Theory of Information Transfer and Combinatorics, pp. 362–380. Springer, Heidelberg (2006)
Maurer, H., Kappe, F., Zaka, B.: Plagiarism - A Survey. Journal of Universal Computer Science, 1050–1084 (2006)
McCabe, D.L.: Cheating among college and university students: A North American perspective. International Journal for Educational Integrity (2005)
Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., Johnson, D.: Terrier Information Retrieval Platform. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 517–519. Springer, Heidelberg (2005)
PAN (2009), http://www.webis.de/pan-09
Peters, C., Ferro, N.: CLEF 2009 Ad Hoc Track Overview: TEL & Persian tasks. In: Working Notes of CLEF 2009 (2009)
Porter, M.F.: An algorithm for suffix stripping, in Readings in information retrieval, pp. 313–316. Morgan Kaufmann, San Francisco (1997)
Potthast, M.: Wikipedia in the pocket: indexing technology for near-duplicate detection and high similarity search. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 909–909 (2007)
Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-language plagiarism detection. In: Language Resources and Evaluation (2010) (Published online on January 30, 2010)
Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., Rosso, P.: Overview of the 1st International Competition on Plagiarism Detection. In: Proceedings of the SEPLN’09 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, pp. 1–9 (2009)
Pouliquen, B., Steinberger, R., Ignat, C.: Automatic Identification of Document Translations in Large Multilingual Document Collections. In: Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP 2003), pp. 401–408 (2003)
Quinlan, J.R.: C4.5: programs for machine learning, p. 302. Morgan Kaufmann, San Francisco (1993)
The md5 message-digest algorithm, http://theory.lcs.mit.edu/~rivest/rfc1321.txt
Roig, M.: Avoiding plagiarism, self-plagiarism, and other questionable writing practices: A guide to ethical writing (2010)
Stein, B., Eissen, S.M.z.: Near Similarity Search and Plagiarism Analysis. In: From Data and Information Analysis to Knowledge Engineering, pp. 430–437. Springer, Heidelberg (2006)
Stein, B., Eissen, S.M.z.: Intrinsic Plagiarism Analysis with Meta Learning. In: SIGIR 2007 - Workshop on Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Corezola Pereira, R., Moreira, V.P., Galante, R. (2010). A New Approach for Cross-Language Plagiarism Analysis. In: Agosti, M., Ferro, N., Peters, C., de Rijke, M., Smeaton, A. (eds) Multilingual and Multimodal Information Access Evaluation. CLEF 2010. Lecture Notes in Computer Science, vol 6360. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15998-5_4
Download citation
DOI: https://doi.org/10.1007/978-3-642-15998-5_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15997-8
Online ISBN: 978-3-642-15998-5
eBook Packages: Computer ScienceComputer Science (R0)