Skip to main content

A New Approach for Cross-Language Plagiarism Analysis

  • Conference paper
Multilingual and Multimodal Information Access Evaluation (CLEF 2010)

Abstract

This paper presents a new method for Cross-Language Plagiarism Analysis. Our task is to detect the plagiarized passages in the suspicious documents and their corresponding fragments in the source documents. We propose a plagiarism detection method composed by five main phases: language normalization, retrieval of candidate documents, classifier training, plagiarism analysis, and post-processing. To evaluate our method, we created a corpus containing artificial plagiarism offenses. Two different experiments were conducted; the first one considers only monolingual plagiarism cases, while the second one considers only cross-language plagiarism cases. The results showed that the cross-language experiment achieved 86% of the performance of the monolingual baseline. We also analyzed how the plagiarized text length affects the overall performance of the method. This analysis showed that our method achieved better results with medium and large plagiarized passages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Argamon, S., Levitan, S.: Measuring the Usefulness of Function Words for Authorship Attribution. In: Association for Literary and Linguistic Computing/ Association Computer Humanities (2005)

    Google Scholar 

  2. Barrón-Cedeño, A., Rosso, P.: On Automatic Plagiarism Detection Based on n-Grams Comparison. In: Proceedings of the 31st European Conference on IR Research on Advances in Information Retrieval, pp. 696–700 (2009)

    Google Scholar 

  3. Barrón-Cedeño, A., Rosso, P., Benedí, J.-M.: Reducing the Plagiarism Detection Search Space on the basis of the Kullback-Leibler Distance. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 523–534. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  4. Barrón-Cedeño, A., Rosso, P., Pinto, D., Juan, A.: On Cross-lingual Plagiarism Analysis using a Statistical Model. In: Proceedings of the ECAI 2008 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse (2008)

    Google Scholar 

  5. Ceska, Z., Toman, M., Jezek, K.: Multilingual Plagiarism Detection. In: Dochev, D., Pistore, M., Traverso, P. (eds.) AIMSA 2008. LNCS (LNAI), vol. 5253, pp. 83–92. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  6. Google Translator, http://www.google.com/translate_t

  7. Grefenstette, G.: Cross-Language Information Retrieval, p. 182. Kluwer Academic Publishers, Boston (1998)

    Book  Google Scholar 

  8. Grozea, C., Gehl, C., Popescu, M.: ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection. In: Proceedings of the SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, pp. 10–18 (2009)

    Google Scholar 

  9. Hull, D.A., Grefenstette, G.: Querying Across Languages, a Dictionary-based approach to Multilingual Information Retrieval. In: 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1996)

    Google Scholar 

  10. Kasprzak, J., Brandejs, M., Křipač, M.: Finding Plagiarism by Evaluating Document Similarities. In: Proceedings of the SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, pp. 24–28 (2009)

    Google Scholar 

  11. Koehn, P.E.: A Parallel Corpus for Statistical Machine Translation. In: MT Summit (2005)

    Google Scholar 

  12. Koppel, M., Schler, J.: Authorship Verification as a One-Class Classification Problem. In: Proceedings of the 21st International Conference on Machine Learning. ACM, New York (2004)

    Google Scholar 

  13. Lathrop, A., Foss, K.: Student Cheating and Plagiarism in the Internet Era. A Wake-Up Call, p. 255. Libraries Unlimited, Inc., Englewood (2000)

    Google Scholar 

  14. LEC Power Translator, http://www.lec.com/power-translator-software.asp

  15. Malyutov, M.B.: Authorship Attribution of Texts: A Review in General Theory of Information Transfer and Combinatorics, pp. 362–380. Springer, Heidelberg (2006)

    Book  MATH  Google Scholar 

  16. Maurer, H., Kappe, F., Zaka, B.: Plagiarism - A Survey. Journal of Universal Computer Science, 1050–1084 (2006)

    Google Scholar 

  17. McCabe, D.L.: Cheating among college and university students: A North American perspective. International Journal for Educational Integrity (2005)

    Google Scholar 

  18. Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., Johnson, D.: Terrier Information Retrieval Platform. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 517–519. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  19. PAN (2009), http://www.webis.de/pan-09

  20. Peters, C., Ferro, N.: CLEF 2009 Ad Hoc Track Overview: TEL & Persian tasks. In: Working Notes of CLEF 2009 (2009)

    Google Scholar 

  21. Porter, M.F.: An algorithm for suffix stripping, in Readings in information retrieval, pp. 313–316. Morgan Kaufmann, San Francisco (1997)

    Google Scholar 

  22. Potthast, M.: Wikipedia in the pocket: indexing technology for near-duplicate detection and high similarity search. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 909–909 (2007)

    Google Scholar 

  23. Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-language plagiarism detection. In: Language Resources and Evaluation (2010) (Published online on January 30, 2010)

    Google Scholar 

  24. Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., Rosso, P.: Overview of the 1st International Competition on Plagiarism Detection. In: Proceedings of the SEPLN’09 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, pp. 1–9 (2009)

    Google Scholar 

  25. Pouliquen, B., Steinberger, R., Ignat, C.: Automatic Identification of Document Translations in Large Multilingual Document Collections. In: Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP 2003), pp. 401–408 (2003)

    Google Scholar 

  26. Quinlan, J.R.: C4.5: programs for machine learning, p. 302. Morgan Kaufmann, San Francisco (1993)

    Google Scholar 

  27. The md5 message-digest algorithm, http://theory.lcs.mit.edu/~rivest/rfc1321.txt

  28. Roig, M.: Avoiding plagiarism, self-plagiarism, and other questionable writing practices: A guide to ethical writing (2010)

    Google Scholar 

  29. Stein, B., Eissen, S.M.z.: Near Similarity Search and Plagiarism Analysis. In: From Data and Information Analysis to Knowledge Engineering, pp. 430–437. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  30. Stein, B., Eissen, S.M.z.: Intrinsic Plagiarism Analysis with Meta Learning. In: SIGIR 2007 - Workshop on Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection (2007)

    Google Scholar 

  31. Weka, http://www.cs.waikato.ac.nz/ml/weka/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Corezola Pereira, R., Moreira, V.P., Galante, R. (2010). A New Approach for Cross-Language Plagiarism Analysis. In: Agosti, M., Ferro, N., Peters, C., de Rijke, M., Smeaton, A. (eds) Multilingual and Multimodal Information Access Evaluation. CLEF 2010. Lecture Notes in Computer Science, vol 6360. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15998-5_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-15998-5_4

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15997-8

  • Online ISBN: 978-3-642-15998-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics