Advertisement

An efficient classification approach in imbalanced datasets for intrinsic plagiarism detection

  • Andrianna Polydouri
  • Eleni Vathi
  • Georgios Siolas
  • Andreas Stafylopatis
Original Paper
  • 20 Downloads

Abstract

The ever increasing volume of information due to the widespread use of computers and the web has made effective plagiarism detection methods a necessity. Plagiarism can be found in many settings and forms, in literature, in academic papers, even in programming code. Intrinsic plagiarism detection is the task that deals with the discovery of plagiarized passages in a text document, by identifying the stylistic changes and inconsistencies within the document itself, given that no reference corpus is available. The main idea consists in profiling the style of the original author and marking the passages that seem to differ significantly. In this work, we follow a supervised machine learning classification approach. We consider, for the first time, the fact of imbalanced data as a crucial parameter of the problem and experiment with various balancing techniques. Apart from this, we propose some novel stylistic features. We combine our features and imbalanced dataset treatment with various classification methods. Our detection system is tested on the data corpora of PAN Webis intrinsic plagiarism detection shared tasks. It is compared to the best performing detection systems on these datasets, and succeeds the best resulting scores.

Keywords

Intrinsic plagiarism detection Stylometry Supervised learning Unbalanced training data SMOTE PAN Webis 

Notes

Acknowledgements

Many thanks to Panagiotis Christou, whose comments essentially helped to overcome some problems during the design of this system.

References

  1. Alsallal M, Iqbal R, Amin S, James A (2013) Intrinsic plagiarism detection using latent semantic indexing and stylometry. In: 2013 Sixth international conference on developments in eSystems engineering, Abu Dhabi, pp 145–150.  https://doi.org/10.1109/DeSE.2013.34
  2. Alzahrani S, Salim N, Abraham A (2012) Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Trans Syst Man Cybern Part C (Applications and Reviews) 42:133–149CrossRefGoogle Scholar
  3. Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor Newsl 6(1):20–29.  https://doi.org/10.1145/1007730.1007735 (ISSN 1931-0145)
  4. Bensalem I, Rosso P, Chikhi S (2014) Intrinsic plagiarism detection using n-gram classes. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25–29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp 1459–1464. https://aclweb.org/anthology/D/D14/D14-1153.pdf
  5. Bowyer KW, Chawla NV, Hall LO, Kegelmeyer WP (2011) SMOTE: synthetic minority over-sampling technique. CoRR, abs/1106.1813. https://arxiv.org/abs/1106.1813
  6. Cheng N, Chandramouli R, Subbalakshmi KP (2011) Author gender identification from text. Digit Investig 8(1): 78–88.  https://doi.org/10.1016/j.diin.2011.04.002 (ISSN 1742-2876)
  7. Curran D (2010) An evolutionary neural network approach to intrinsic plagiarism detection. In: Proceedings of the 20th Irish Conference on Artificial Intelligence and Cognitive Science, AICS’09, Springer-Verlag, Berlin, Heidelberg, pp 33–40. https://dl.acm.org/citation.cfm?id=1939047.1939055 (ISBN 3-642-17079-X, 978-3-642-17079-9)
  8. Dubay WH (2004) The principles of readability. Impact Information, Costa MesaGoogle Scholar
  9. zu Eissen SM, Stein B (2006) Intrinsic plagiarism detection. In: Lalmas M, MacFarlane A, Rüger S, Tombros A, Tsikrika T, Yavlinsky A (eds) Advances in information retrieval. Springer, Berlin Heidelberg, pp 565–569 (ISBN 978-3-540-33348-7).Google Scholar
  10. Holmes DI (1998) The evolution of stylometry in humanities scholarship. Lit Linguist Comput 13(3): 111–117.  https://doi.org/10.1093/llc/13.3.111
  11. Hua X, Li S, Li P, Zhu Q (2013) Research on intrinsic plagiarism detection resolution: a supervised learning approach. In: Ji D, Xiao G (eds) Chinese lexical semantics. Springer, Berlin, Heidelberg, pp 58–63 (ISBN 978-3-642-36337-5)Google Scholar
  12. Kestemont M, Luyckx K, Daelemans W (2011) Intrinsic plagiarism detection using character trigram distance scores—notebook for PAN at CLEF 2011. In: Petras V, Forner P, Clough PD (eds) Notebook papers of CLEF 2011 labs and workshops, 19–22 September 2011, Amsterdam, The Netherlands (ISBN 978-88-904810-1-7, 2038-4963)Google Scholar
  13. Koppel M, Schler J (2004) Authorship verification as a one-class classification problem. In: Proceedings of the Twenty-first International Conference on Machine Learning, ICML, ACM ’04, New York, NY, USA.  https://doi.org/10.1145/1015330.1015448 (ISBN 1-58113-838-5)
  14. Kuta M, Kitowski J (2014) Optimisation of character n-gram profiles method for intrinsic plagiarism detection. In: Rutkowski L, Korytkowski M, Scherer R, Tadeusiewicz R, Zadeh LA, Zurada JM (eds) Artificial intelligence and soft computing. Springer International Publishing, Cham, pp 500–511 (ISBN 978-3-319-07176-3)Google Scholar
  15. Kuznetsov M, Motrenko A, Kuznetsova R, Strijov V (2016) Methods for intrinsic plagiarism detection and author diarization—notebook for PAN at CLEF 2016. In: Balog K, Cappellato L, Ferro N, Macdonald C (eds) CLEF 2016 evaluation labs and workshop —working notes papers, 5–8 September 2016, Évora, Portugal, CEUR-WS.org (ISSN 1613 0073)Google Scholar
  16. Lemaitre G, Nogueira F, Aridas CK (2016) Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. CoRR. http://arxiv.org/abs/1609.06570
  17. Mihalcea RF, Radev DR (2011) Graph-based Natural Language Processing and Information Retrieval, 1st edn. Cambridge University Press, New York (ISBN 0521896134, 9780521896139)Google Scholar
  18. Oberreuter G, L’Huillier G, Ríos SA, Velásquez JD (2011) Approaches for intrinsic and external plagiarism detection—notebook for PAN at CLEF 2011. In: Petras V, Forner P, Clough PD (eds) Notebook papers of CLEF 2011 labs and workshops, 19–22 September 2011, Amsterdam, The Netherlands (ISBN 978-88-904810-1-7, 2038-4963)Google Scholar
  19. Oberreuter G, Velásquez JD (2013) Text mining applied to plagiarism detection: the use of words for detecting deviations in the writing style. Expert Syst Appl 40(9):3756–3763.  https://doi.org/10.1016/j.eswa.2012.12.082 CrossRefGoogle Scholar
  20. Potthast M, Eiselt A, Cedeo AB, Stein B, Rosso P (2011) Overview of the 3rd international competition on plagiarism detection. In: Working Notes Papers of the CLEF 2011 EvaluationGoogle Scholar
  21. Potthast M, Stein B, Eiselt A, Weimar BU, Cedeo AB, Rosso P (2009) Overview of the 1st international competition on plagiarism detection. In: SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 09), CEUR-WS.org, pp. 1–9Google Scholar
  22. Ranatunga R, Atukorale A, Hewagamage K (2011) Intrinsic plagiarism detection with Kohonen self organizing maps. In: 2011 International conference on advances in ICT for emerging regions (ICTer). IEEE, pp 125Google Scholar
  23. Rosso P, Rangel F, Potthast M, Stamatatos E, Tschuggnall M, Stein B (2016) Overview of the PAN’2016 - new challenges for authorship analysis: cross-genre profiling, clustering, diarization, and obfuscation. In: 7th Int. Conf. of CLEF on experimental IR meets multilinguality, multimodality, and interaction, CLEF 2016, LNCS(9822), Springer, pp 332–350Google Scholar
  24. Seaward L, Matwin S (2009) Intrinsic plagiarism detection using complexity analysis. In: Stein B et al (eds) SEPLN 2009 Workshop on uncovering plagiarism, authorship, and social software misuse (PAN 09). Universidad Politécnica de Valencia and CEUR-WS.org, pp 56–61 (ISSN 1613-0073)Google Scholar
  25. Sittar A, Iqbal HR, Nawab RMA (2016) Author diarization using cluster-distance approach—notebook for PAN at CLEF 2016. In: Balog K, Cappellato L, Ferro N, Macdonald C (eds) CLEF 2016 evaluation labs and workshop—working notes papers, 5–8 September 2016, Évora, Portugal. CEUR-WS.org (ISSN 1613-0073)Google Scholar
  26. Stamatatos E (2009a) Intrinsic plagiarism detection using character n-gram profiles. In: Stein B, Rosso P, Stamatatos E, Koppel M, Agirre E (eds) SEPLN 2009 workshop on uncovering plagiarism, authorship, and social software misuse (PAN 09), pp 38–46Google Scholar
  27. Stamatatos E (2009b) A survey of modern authorship attribution methods. J Am Soc Inf Sci Technol 60(3):538–556.  https://doi.org/10.1002/asi.v60:3 (ISSN 1532-2882)
  28. Stamatatos E, Daelemans W, Verhoeven B, Juola P, López-López A, Potthast M, Stein B (2015) Overview of the author identification task at pan. In: CLEF 2015 Evaluation Labs and Workshop—Working Notes Papers. CEUR, Toulouse (2015/09/10 2015)Google Scholar
  29. Stein B, Lipka N, Prettenhofer P (2011) Intrinsic plagiarism analysis. Lang Resour Eval 45(1):63–82.  https://doi.org/10.1007/s10579-010-9115-y (ISSN 1574-020X)
  30. Tang Y, Zhang Y, Chawla NV, Krasser S (2009) Svms modeling for highly imbalanced classification. IEEE Trans Syst Man Cybern Part B (Cybern) 39:281–288Google Scholar
  31. Tschuggnall M, Specht G (2012) Plag-inn: Intrinsic plagiarism detection using grammar trees. In: Bouma G, Ittoo A, Métais E, Wortmann H (eds) Natural Language Processing and Information Systems. Springer, Berlin, Heidelberg, pp 284–289 (ISBN 978-3-642-31178-9)Google Scholar
  32. Tschuggnall M, Specht G (2013) Using grammar-profiles to intrinsically expose plagiarism in text documents. In: Métais E, Meziane F, Saraee M, Sugumaran V, Vadera S (eds) Natural Language Processing and Information Systems. Springer, Berlin, Heidelberg, pp 297–302 (ISBN 978-3-642-38824-8)Google Scholar
  33. Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2(3):408–421 (ISSN 0018-9472)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Intelligent Systems, Content and Interaction Laboratory, School of Electrical and Computer EngineeringNational and Technical University of AthensAthensGreece

Personalised recommendations