Abstract
An approach to find the most probable English source document for the given Hindi suspicious document is presented. The approach does not involve complex method of Machine Translation as a language normalization step, rather relies on standard cross-language resources available between Hindi-English and calculates the similarity using the Okapi BM25 model. We also present the further improvements in the system after the analysis and discuss the challenges involved. The system is developed as a part of CLiTR competition and uses the CLiTR-Dataset for the experimentation. The approach achieves the recall of 0.90 - the highest and F-measure of 0.79 - the 2nd highest reported on the Dataset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ceska, Z., Toman, M., Jezek, K.: Multilingual plagiarism detection. In: Dochev, D., Pistore, M., Traverso, P. (eds.) AIMSA 2008. LNCS (LNAI), vol. 5253, pp. 83–92. Springer, Heidelberg (2008)
Gupta, P., Rao, S., Majumder, P.: External plagiarism detection: N-gram approach using named entity recognizer - lab report for pan at clef 2010. In: CLEF (Notebook Papers/LABs/Workshops) (2010)
Gupta, P., Singhal, K., Majumder, P., Rosso, P.: Detection of Paraphrastic Cases of Mono-lingual and Cross-lingual Plagiarism. In: ICON 2011. Macmillan Publishers, Chennai (2011)
McNamee, P., Mayfield, J.: Character n-gram tokenization for european language text retrieval. Inf. Retr. 7(1-2), 73–97 (2004)
Narayan, D., Chakrabarti, D., Pande, P., Bhattacharyya, P.: An experience in building the indo wordnet - a wordnet for hindi. In: First International Conference on Global WordNet, Mysore, India (2002)
Pinto, D., Civera, J., Barrón-Cedeño, A., Juan, A., Rosso, P.: A statistical approach to crosslingual natural language tasks. J. Algorithms 64(1), 51–60 (2009)
Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-Language Plagiarism Detection. Language Resources and Evaluation, Special Issue on Plagiarism and Authorship Analysis 45(1) (2011)
Potthast, M., Stein, B., Anderka, M.: A wikipedia-based multilingual retrieval model. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 522–530. Springer, Heidelberg (2008)
Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., Rosso, P.: Overview of the 1st International Competition on Plagiarism Detection. In: Stein, B., Rosso, P., Stamatatos, E., Koppel, M., Agirre, E. (eds.) SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 2009), pp. 1–9. CEUR-WS.org (2009)
Ramanathan, A., Rao, D.D.: A lightweight stemmer for Hindi. In: Computatinal Linguistics for South Asian Languages, Budapest (April 2003)
Rao, S., Gupta, P., Singhal, K., Majumder, P.: External & intrinsic plagiarism detection: Vsm & discourse markers based approach - notebook for pan at clef. In: CLEF (Notebook Papers/Labs/Workshop) (2011)
Robertson, S., Spärck Jones, K.: Simple, proven approaches to text retrieval. Technical Report UCAM-CL-TR-356, University of Cambridge, Computer Laboratory (1994)
Steinberger, R., Pouliquen, B., Hagman, J.: Cross-lingual document similarity calculation using the multilingual thesaurus EUROVOC. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 415–424. Springer, Heidelberg (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gupta, P., Singhal, K. (2013). Mapping Hindi-English Text Re-use Document Pairs. In: Majumder, P., Mitra, M., Bhattacharyya, P., Subramaniam, L.V., Contractor, D., Rosso, P. (eds) Multilingual Information Access in South Asian Languages. Lecture Notes in Computer Science, vol 7536. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40087-2_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-40087-2_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40086-5
Online ISBN: 978-3-642-40087-2
eBook Packages: Computer ScienceComputer Science (R0)