Advertisement

Author Profiling and Plagiarism Detection

  • Paolo RossoEmail author
Chapter
Part of the Communications in Computer and Information Science book series (CCIS, volume 505)

Abstract

In this paper we introduce the topics that we will cover in the RuSSIR 2014 course on Author Profiling and Plagiarism Detection (APPD). Author profiling distinguishes between classes of authors studying how language is shared by classes of people. This task helps in identifying profiling aspects such as gender, age, native language, or even personality type. In case of the plagiarism detection task we are not interested in studying how language is shared. On the contrary, given a document we are interested in investigating if the writing style changes in order to unveil text inconsistencies, i.e., unexpected irregularities through the document such as changes in vocabulary, style and text complexity. In fact, when it is not possible to retrieve the source document(s) where plagiarism has been committed from, the intrinsic analysis of the suspicious document is the only way to find evidence of plagiarism. The difficulty in retrieving the source of plagiarism could be due to the fact that the documents are not available on the web or the plagiarised text fragments were obfuscated via paraphrasing or translation (in case the source document was in another language). In this overview, we also discuss the results of the shared tasks on author profiling (gender and age identification) and plagiarism detection that we help to organise at the PAN Lab on Uncovering Plagiarism, Authorship, and Social Software Misuse (http://pan.webis.de).

Notes

Acknowledgements

We would like to thank Yuri Chekhovich (Forecsys) and Mikhail Alexandrov (Russian Presidential Academy of national economy and public administration) for providing plagiarised cases in Russian for the APPD course at RuSSIR. We thank Irina Chugur (UNED) and Francisco Rangel (Autoritas Consulting) for helping with the author profiling corpus in Russian. The PAN shared tasks on author profiling and on plagiarism detection have been organised in the framework of the WIQ-EI IRSES project (Grant No. 269180) within the EC FP 7 Marie Curie People. The research work described in the paper was carried out in the framework of the DIANA-APPLICATIONS-Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) project, and the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems. Finally, we thank Hugo Jair Escalante (INAOE) for helping to improve this paper.

References

  1. 1.
    Argamon, S., Koppel, M., Fine, J., Shimoni, A.R.: Gender, genre, and writing style in formal written texts. TEXT 23, 321–346 (2003)CrossRefGoogle Scholar
  2. 2.
    Association of Teachers and Lecturers. School work plagued by plagiarism - ATL survey. Technical report, Association of Teachers and Lecturers, London, UK (2008). (Press release)Google Scholar
  3. 3.
    Barrón-Cedeño, A.: On the mono- and cross-language detection of text re-use and plagiarism. Ph.D. thesis, Universitat Politènica de València (2012)Google Scholar
  4. 4.
    Barrón-Cedeño, A., Rosso, P., Pinto, D., Juan, A.: On cross-lingual plagiarism analysis using a statistical model. In: Proceedings of the ECAI 2008 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, PAN 2008 (2008)Google Scholar
  5. 5.
    Barrón-Cedeño, A., Gupta, P., Rosso, P.: Methods for cross-language plagiarism detection. Knowl. Based Syst. 50, 11–17 (2013)CrossRefGoogle Scholar
  6. 6.
    Barrón-Cedeño, A., Vila, M., Martí, M., Rosso, P.: Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Comput. Linguist. 39(4), 917–947 (2013)CrossRefGoogle Scholar
  7. 7.
    Bogdanova, D., Rosso, P., Solorio, T.: Exploring high-level features for detecting cyberpedophilia. Comput. Speech Lang. 28(1), 108–120 (2014)CrossRefGoogle Scholar
  8. 8.
    Braschler, M., Harman, D.: Notebook papers of CLEF 2010 LABs and workshops. Padua, Italy (2010)Google Scholar
  9. 9.
    Cappellato, L., Ferro, N., Halvey, M., Kraaij, W.: CLEF 2014 labs and workshops, notebook papers. In: CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613–0073 (2014). http://ceur-ws.org/Vol-1180/
  10. 10.
    Comas, R., Sureda, J., Nava, C., Serrano, L.: Academic cyberplagiarism: a descriptive and comparative analysis of the prevalence amongst the undergraduate students at Tecmilenio University (Mexico) and Balearic Islands University (Spain). In: Proceedings of the International Conference on Education and New Learning Technologies (EDULEARN 2010), Barcelona (2010)Google Scholar
  11. 11.
    Flesch, R.: A new readability yardstick. J. Appl. Psychol. 32(3), 221–233 (1948)CrossRefGoogle Scholar
  12. 12.
    Flores, E., Barrón-Cedeño, A., Rosso, P., Moreno, L.: Desocore: detecting source code re-use across programming languages. In: Proceedings of 12th International Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-2012, pp. 1–4, Montreal, Canada (2012)Google Scholar
  13. 13.
    Flores, E., Barrón-Cedeño, A., Moreno, L., Rosso, P.: Uncovering source code re-use in large-scale programming environments. In: Computer Applications in Engineering and Education, Accepted (2014). doi: 10.1002/cae.21608
  14. 14.
    Forner, P., Navigli, R., Tufis, D.: CLEF 2013 evaluation labs and workshop - working notes papers, 23–26 September. Valencia, Spain (2013)Google Scholar
  15. 15.
    Franco-Salvador, M., Gupta, P., Rosso, P.: Cross-Language plagiarism detection using a multilingual semantic network. In: Braslavski, P., Kuznetsov, S.O., Kamps, J., Rüger, S., Agichtein, E., Segalovich, I., Yilmaz, E., Serdyukov, P. (eds.) ECIR 2013. LNCS, vol. 7814, pp. 710–713. Springer, Heidelberg (2013) CrossRefGoogle Scholar
  16. 16.
    Franco-Salvador, M., Gupta, P., Rosso, P.: Knowledge graphs as context models: improving the detection of cross-language plagiarism with paraphrasing. In: Ferro, N. (ed.) PROMISE Winter School 2013. LNCS, vol. 8173, pp. 227–236. Springer, Heidelberg (2014) CrossRefGoogle Scholar
  17. 17.
    Gollub, T., Stein, B., Burrows, S.: Ousting Ivory tower research: towards a web framework for providing experiments as a service. In: Hersh, B., Callan, J., Maarek, Y., Sanderson, M., (eds.) 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2012), pp. 1125–1126. ACM, August 2012. ISBN 978-1-4503-1472-5. doi: 10.1145/2348283.2348501
  18. 18.
    Gollub, T., Hagen, M., Michel, M., Stein, B.: From keywords to keyqueries: content descriptors for the web. In: Gurrin, C., Jones, G., Kelly, D., Kruschwitz, U., de Rijke, M., Sakai, T., Sheridan, P., (eds.) 36th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2013), pp. 981–984. ACM (2013)Google Scholar
  19. 19.
    Goswami, S., Sarkar, S., Rustagi, M.: Stylometric analysis of bloggers’ age and gender. In: Adar, E., Hurst, M., Finin, T., Glance, N.S., Nicolov, N., Tseng, B.L., (eds.) ICWSM. The AAAI Press (2009)Google Scholar
  20. 20.
    Gressel, G., Hrudya, P., Surendran, K., Thara, S., Aravind, A., Prabaharan, P.: Ensemble Learning Approach for Author Profiling-Notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Google Scholar
  21. 21.
    Grozea, C., Popescu, M.: ENCOPLOT - performance in the Second International Plagiarism Detection Challenge lab report for PAN at CLEF 2010. In: Braschler and Harman [8]Google Scholar
  22. 22.
    Grozea, C., Gehl, C., Popescu, M.: ENCOPLOT: pairwise sequence matching in linear time applied to plagiarism detection. In: Stein et al., (ed.) Overview of the 1st International Competition on Plagiarism Detection, pp. 10–18 (2009)Google Scholar
  23. 23.
    Gunning, R.: The Technique of Clear Writing. McGraw-Hill Int. Book Co, New York (1952) Google Scholar
  24. 24.
    Gupta, P., Barrón-Cedeño, A., Rosso, P.: Cross-language high similarity search using a conceptual thesaurus. In: Catarci, T., Peñas, A., Santucci, G., Forner, P., Hiemstra, D. (eds.) CLEF 2012. LNCS, vol. 7488, pp. 67–75. Springer, Heidelberg (2012) Google Scholar
  25. 25.
    Honore, A.: Some simple measures of richness of vocabulary. Assoc. Lit. Linguist. Comput. Bull. 7(2), 172–177 (1979)Google Scholar
  26. 26.
    IEEE. A Plagiarism FAQ. http://www.ieee.org/publications_standards/publications/rights/plagiarism_FAQ.html (2008). Published: 2008; Last Accessed 25 November 2012
  27. 27.
    Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Lit. Linguist. Comput. 17(4), 401–412 (2002)CrossRefGoogle Scholar
  28. 28.
    Liau, Y., Vrizlynn, L.: Submission to the author profiling competition at pan-2014. In: Proceedings Recent Advances in Natural Language Processing III (2014). http://www.webis.de/research/events/pan-14
  29. 29.
    Lopez-Monroy, A.P., Montes-Y-Gomez, M., Escalante, H.J., Villaseñor-Pineda, L., Villatoro-Tello, E.: INAOE’s participation at PAN 2013: author profiling task–notebook for PAN at CLEF 2013. In: Forner, et al. [14]Google Scholar
  30. 30.
    Pastor López-Monroy, A., Montes y Gómez, M., Escalante, H.J., Villaseñor-Pineda, L.: Using Intra-profile information for author profiling-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Google Scholar
  31. 31.
    Maharjan, S., Shrestha, P., Solorio, T.: A simple approach to author profiling in MapReduce–notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Google Scholar
  32. 32.
    Marquardt, J., Fanardi, G., Vasudevan, G., Moens, M.F., Davalos, S., Teredesai, A., De Cock, M.: Age and gender identification in social media-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Google Scholar
  33. 33.
    Martin, B.: Plagiarism: policy against cheating or policy for learning? Nexus (Newsl. Aust. Sociol. Assoc.) 16(2), 15–16 (2004)MathSciNetGoogle Scholar
  34. 34.
    Mcnamee, P., Mayfield, J.: Character n-gram tokenization for european language text retrieval. Inf. Retr. 7(1), 73–97 (2004)CrossRefGoogle Scholar
  35. 35.
    Meina, M., Brodzinska, K., Celmer, B., Czokow, M., Patera, M., Pezacki, J., Wilk, M.: Ensemble-based classification for author profiling using various features-notebook for PAN at CLEF 2013. In: Forner, et al. [14]Google Scholar
  36. 36.
    Eissen, S.M., Stein, B.: Intrinsic plagiarism detection. In: Tombros, A., Yavlinsky, A., Rüger, S.M., Tsikrika, T., Lalmas, M., MacFarlane, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 565–569. Springer, Heidelberg (2006) CrossRefGoogle Scholar
  37. 37.
    Montes y Gómez, M., Gelbukh, A.F., López-López, A., Baeza-Yates, R.A.: Flexible comparison of conceptual graphs. In: Proceedings DEXA, pp. 102–111 (2001)Google Scholar
  38. 38.
    Navigli, R., Ponzetto, S.P.: BabelNet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell. 193, 217–250 (2012)MathSciNetCrossRefGoogle Scholar
  39. 39.
    Nawab, R.M.A., Stevenson, M., Clough, P.: University of sheffield lab report for pan at clef 2010. In: Braschler and Harman [8]Google Scholar
  40. 40.
    Nguyen, D., Gravel, R., Trieschnigg, D., Meder, T.: “how old do you think i am?”; a study of language and age in twitter. In: Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media (2013)Google Scholar
  41. 41.
    Oberreuter, G., Eiselt, A.: Submission to the 6th international competition on plagiarism detection, From Innovand.io, Chile (2014). http://www.webis.de/research/events/pan-14
  42. 42.
    Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)CrossRefGoogle Scholar
  43. 43.
    Palkovskii, Y., Belov, A.: Developing high-resolution universal multi-type N-Gram plagiarism detector-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Google Scholar
  44. 44.
    Pennebaker, J.W., Mehl, M.R., Niederhoffer, K.G.: Psychological aspects of natural language use: our words, our selves. Ann. Rev. Psychol. 54(1), 547–577 (2003)CrossRefGoogle Scholar
  45. 45.
    Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for plagiarism detection. In: COLING 2010: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 997–1005 (2010)Google Scholar
  46. 46.
    Potthast, M., Stein, B., Anderka, M.: A wikipedia-based multilingual retrieval model. In: Plachouras, V., Macdonald, C., Ounis, I., White, R.W., Ruthven, I. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 522–530. Springer, Heidelberg (2008) CrossRefGoogle Scholar
  47. 47.
    Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., Rosso, P.:. Overview of the 1st international competition on plagiarism detection. In: Stein, B., Rosso, P., Stamatatos, E., Koppel, M., Agirre, E., (eds.) Proceedings of the SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 2009), pp. 1–9, 2009. CEUR-WS.org (September 2009). http://ceur-ws.org/Vol-502
  48. 48.
    Potthast, M., Barrón-Cedeño, A., Eiselt, A., Stein, B., Rosso, P.: Overview of the 2nd International Competition on Plagiarism Detection. In: Braschler and Harman [8]Google Scholar
  49. 49.
    Potthast, M., Barrón-Cedeño, A., Eiselt, A., Stein, B., Rosso, P.: Overview of the 2nd international competition on plagiarism detection. In: Braschler, M., Harman, D., Pianta, E., (eds.) Working Notes Papers of the CLEF 2010 Evaluation Labs (September 2010) 2010. http://www.clef-initiative.eu/publication/working-notes
  50. 50.
    Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-language plagiarism detection. Lang. Resour. Eval. 45(1), 45–62 (2011)CrossRefGoogle Scholar
  51. 51.
    Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B., Rosso, P.: Overview of the 3rd international competition on plagiarism detection. In: Petras, V., Forner, P., Clough, P., (eds.) Working Notes Papers of the CLEF 2011 Evaluation Labs (September 2011) (2011). http://www.clef-initiative.eu/publication/working-notes
  52. 52.
    Potthast, M., Gollub, T., Hagen, M., Grabegger, J., Kiesel, J., Michel, M., Oberlander, A., Tippmann, M., Barrón-Cedeño, A., Gupta, P., Rosso, P., Stein, B.: Overview of the 4th international competition on plagiarism detection. In: Forner, P., Karlgren, J., Womser-Hacker, C., (eds.) Working Notes Papers of the CLEF 2012 Evaluation Labs (September 2012) (2012). http://www.clef-initiative.eu/publication/working-notes
  53. 53.
    Potthast, M., Hagen, M., Stein, B., Grabegger, J., Michel, M., Tippmann, M., Welsch, C.: Chatnoir: a search engine for the clueweb09 corpus. In: Hersh, B., Callan, J., Maarek, Y., Sanderson, M., (eds.) 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2012), p. 1004 (2012)Google Scholar
  54. 54.
    Potthast, M., Gollub, T., Hagen, M., Tippmann, M., Kiesel, J., Rosso, P., Stamatatos, E., Stein, B.: Overview of the 5th international competition on plagiarism detection. In: Forner, et al. [14]Google Scholar
  55. 55.
    Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, M., Rosso, P., Stein, B.: Overview of the 6th International Competition on Plagiarism Detection. In: Cappellato, et al. [9]Google Scholar
  56. 56.
    Pouliquen, B., Steinberger, R., Ignat, C.: Automatic linking of similar texts across languages. In: Proceedings of Recent Advances in Natural Language Processing III, RANLP 2003, pp. 307–316 (2003)Google Scholar
  57. 57.
    Prakash, A., Saha, S.: Experiments on document chunking and query formation for plagiarism source retrieval-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Google Scholar
  58. 58.
    Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author profiling task at PAN 2013–notebook for PAN at CLEF 2013. In: Forner, et al. [14]Google Scholar
  59. 59.
    Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkman, M., Stein, B., Verhoeven, B., Daelemans, W.: Overview of the 2nd author profiling task at PAN 2014–notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Google Scholar
  60. 60.
    Sanchez-Perez, M., Sidorov, G., Gelbukh, A.: A winning approach to text alignment for text reuse detection at PAN 2014-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Google Scholar
  61. 61.
    Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, pp. 199–205. AAAI (2006)Google Scholar
  62. 62.
    Stamatatos, E.: Intrinsic plagiarism detection using character n-gram profiles. In: Stein, B., Rosso, P., Stamatatos, E., Koppel, M., Agirre, E., (eds.) Proceedings of the SEPLN09 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 2009), pp. 38–46, 2009. CEUR-WS.org, September 2009. http://ceur-ws.org/Vol-502
  63. 63.
    Stein, B., Meyer zu Eissen, S., Potthast, M.: Strategies for retrieving plagiarized documents. In: Clarke, C., Fuhr, N., Kando, N., Kraaij, W., de Vries, A., (eds.) 30th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2007), pp. 825–826. ACM (2007)Google Scholar
  64. 64.
    Stein, B., Potthast, M., Rosso, P., Barrón-Cedeño, A., Stamatatos, E., Koppel, M.: Fourth international workshop on uncovering plagiarism, authorship, and social software misuse. ACM SIGIR Forum 45, 45–48 (2011)CrossRefGoogle Scholar
  65. 65.
    Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., Varga, D.: The jrc-acquis: a multilingual aligned parallel corpus with +20 languages. In: Proceedings of 5th International Conference on language resources and evaluation LREC 2006 (2006)Google Scholar
  66. 66.
    Suchomel, S., Brandejs, M.: Heterogeneous queries for synoptic and phrasal search-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Google Scholar
  67. 67.
    Villena-Román, J., González-Cristóbal, J.C.: DAEDALUS at PAN 2014: Guessing Tweet Author’s Gender and Age-Notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Google Scholar
  68. 68.
    Vossen, P.: Eurowordnet: a multilingual database of autonomous and language-specific wordnets connected via an inter-lingual index. Int. J. Lexicography 17, 161–173 (2004)CrossRefGoogle Scholar
  69. 69.
    Wang, H., Lu, Y., Zhai, C.: Latent aspect rating analysis on review text data: a rating regression approach. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 783–792 (2010)Google Scholar
  70. 70.
    Weren, E.R.D., Moreira, V.P., de Oliveira, J.P.M.:. Exploring information retrieval features for author profiling-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Google Scholar
  71. 71.
    Williams, K., Chen, H.H., Giles, C.: Supervised ranking for plagiarism source retrieval-notebook for PAN at CLEF 2014. In: Cappellato, et al. [9]Google Scholar
  72. 72.
    Yule, G.: The Statistical Study of Literary Vocabulary. Cambridge University press, Cambridge (1944) Google Scholar
  73. 73.
    Zubarev, D., Sochenkov, I.: Using sentence similarity measure for plagiarism source retrieval-notebook for PAN at CLEF 2014. In: Cappellato, L., et al. [9]Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (http://creativecommons.org/licenses/by-nc/2.5/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Authors and Affiliations

  1. 1.Natural Language Engineering Lab, PRHLT Research CenterUniversitat Politècnica de ValènciaValenciaSpain

Personalised recommendations