Developing a Framework for a Thai Plagiarism Corpus

  • Santipong ThaiprayoonEmail author
  • Pornpimon PalingoonEmail author
  • Kanokorn TrakultaweekoonEmail author
  • Supon KlaithinEmail author
  • Choochart HaruechaiyasakEmail author
  • Alisa KongthonEmail author
  • Sumonmas ThatpitakkulEmail author
  • Sawit KasuriyaEmail author
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 1215)


One problem of building a Thai plagiarism corpus is the unavailability of the corpus with real examples of plagiarized texts. To solve the problem, we present a new design and construction of a Thai plagiarism corpus, called TPLAC-2019, to evaluate the plagiarism detection algorithms for Thai. The process of Thai plagiarism corpus creation consists of two methods: 1) simulated plagiarism method, and 2) artificial plagiarism method. For the simulated plagiarism method, we provided a Thai plagiarism tagging tool called PlaTool and a Thai plagiarism guideline for assisting human annotators to plagiarize the text passages. As for artificial plagiarism method, plagiarized documents are automatically generated by a machine. Besides, a new method to automatically create plagiarized text passages is proposed in the artificial plagiarism method. The objective of this proposed method is to automatically create plagiarized text passages that resemble human language. To evaluate the performance of machine-generated Thai plagiarized text passages, we prepared the test sets which are generated from the baseline and the proposed methods. The experiments are set up to compare the readability of human-readable texts in plagiarized documents between two different methods. The experimental results show that the proposed method helps improve the readability of human-readable texts which is increased up to 40%.


Thai plagiarism corpus Simulated plagiarism Artificial plagiarism Text obfuscation 


  1. 1.
    Clough, P., Stevenson, M.: Developing a corpus of plagiarised short answers. Lang. Resour. Eval. 45(1), 5–24 (2011)CrossRefGoogle Scholar
  2. 2.
    Taerungruang, S., Aroonmanakun, W.: Constructing an academic Thai plagiarism corpus for benchmarking plagiarism detection systems. J. Lang. Stud. 18(3), 186–202 (2018)Google Scholar
  3. 3.
    Miranda-Jiménez, S., Stamatatos, E.: Automatic generation of summary obfuscation corpus for plagiarism detection. J. Appl. Sci. 14(3), 99–112 (2017)Google Scholar
  4. 4.
    Juričić, V., Štefanec, V., Bosanac, S.: Multilingual plagiarism detection corpus. In: 35th International Convention MIPRO, pp. 1310–1314. IEEE, Croatia (2012)Google Scholar
  5. 5.
    Barrón-Cedeño, A., Potthast, M., Rosso, P., Stein, B., Eiselt, A.: Corpus and evaluation measures for automatic plagiarism detection. In: The Seventh Conference on International Language Resources and Evaluation, Malta (2010)Google Scholar
  6. 6.
    Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for plagiarism detection. In: 23rd International Conference on Computational Linguistics, pp. 997–1005. Association for Computational Linguistics, China (2010)Google Scholar
  7. 7.
    Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., Rosso, P.: Overview of the 1st international competition on plagiarism detection. In: SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 09), pp. 1–9 (2009)Google Scholar
  8. 8.
    Mohtaj, S., Asghari, H., Zarrabi, V.: Developing monolingual English corpus for plagiarism detection using human annotated paraphrase corpus. In: Working Notes of CLEF 2015 (2015)Google Scholar
  9. 9.
    Siddiqui, M.A., Khan, I.H., Jambi, K.M., Elhaj, S.O., Bagais, A.: Developing an Arabic plagiarism detection corpus. In: The International Conference on Computer Science, Engineering and Information Technology (CSEIT-2014), Australia, pp. 261–269 (2014)Google Scholar
  10. 10.
    Sharjeel, M., Rayson, P., Muhammad, R., Nawab, A.: UPPC-Urdu paraphrase plagiarism corpus. In: 10th International Conference on Language Resources and Evaluation Conference (LREC), pp. 1832–1836. Lancaster University (2016)Google Scholar
  11. 11.
    Barrón-Cedeño, A., Vila, M., Marti, M.A., Rosso, P.: Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection. Comput. Linguist. 39(4), 917–947 (2013)CrossRefGoogle Scholar
  12. 12.
    Clough, P., Gaizauskas, R., Piao, S.S., Wilks, Y., METER: MEasuring TExt Reuse. In: 40th Annual Meeting of the Association for Computational Linguistics, pp. 152–159. Association for Computational Linguistics, Pennsylvania (2002)Google Scholar
  13. 13.
    Potthast, M., Barrón-Cedeño, A., Eiselt, A., Stein, B., Rosso, P.: Overview of the 2nd international competition on plagiarism detection. In: Notebook Papers of CLEF 2010 LABs and Workshops (2010)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2020

Authors and Affiliations

  1. 1.National Electronics and Computer Technology CenterKhlong LuangThailand

Personalised recommendations