Skip to main content

Semantic and Similarity Measure Methods for Plagiarism Detection of Students’ Assignments

  • Conference paper
  • First Online:
Proceedings of the Second International Afro-European Conference for Industrial Advancement AECIA 2015

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 427))

Abstract

This paper aims at detecting semantic plagiarism in Czech texts. The paper integrates a similarity measure technique previously used for text compression along with a synonyms structured thesaurus and a stemming algorithm to detect rewording and restructuring of texts in Czech language. Out of a 100 GB corpus, we extracted 884 files of B.A., M.A., and Ph.D. students’ assignments, semester works and theses, from Computer Science major. The total size of the extracted testing data used was 1.98 GB of plain text for our initial experiment. The method is tested first on short texts. Then, the method is applied on longer texts of students’ assignments. Our results on short texts showed more accurate results to detect paraphrased texts of semantic similarity, but lower accuracy was detected in case of identical texts with rearranged paragraphs. Our results experiment conducted on the long texts corpus of students’ assignment and theses show a semantic plagiarism rate of 23.9 %. However, after manual scanning of documents, some noise results occur as a result of using the same technical terms and scientific definitions and references in bibliography lists in different documents. These results will be fine-tuned and optimized in the future by building a file—specific stop word list, additional exact match method and removing references and other standard text templates often used in certain parts of students’ assignment works and theses.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. http://www.merriam-webster.com/dictionary/plagiarism

  2. Chuda, D., Uhlik, M.: The plagiarism detection by compression method. In: Proceedings of the 12th International Conference on Computer Systems and Technologies, pp. 429–434. ACM June 2011

    Google Scholar 

  3. Pala, K., Vsiansky, J.: Slovník českých synonym. Lidové noviny (1996)

    Google Scholar 

  4. Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison Wesley (2005)

    Google Scholar 

  5. Konchady, M.: Building Search Applications: Lucene, LingPipe, and Gate 1st edn. Mustru Publishing (2008)

    Google Scholar 

  6. Lee, M.D., Pincombe, B., Welsh, M.: An empirical evaluation of models of text document similarity. In: Bara, B.G., Barsalou, L., Bucciarelli, M. (eds.) 27th Annual Meeting of the Cognitive Science Society, pp. 1254–1259 (2005)

    Google Scholar 

  7. Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-language plagiarism detection. Lang. Resour. Eval. 45(1), 45–62 (2011)

    Article  Google Scholar 

  8. Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 76–85. ACM June 2003

    Google Scholar 

  9. Prilepok, M., Platos, J., Snasel, V.: Similarity based on data compression. In: Advances in Soft Computing and Its Applications, pp. 267–278. Springer, Berlin, Heidelberg (2013)

    Google Scholar 

  10. Soori, H., Prilepok, M., Platos, J., Berhan, E., Snasel, V.: Text similarity based on data compression in Arabic. In: AETA 2013: Recent Advances in Electrical Engineering and Related Sciences, pp. 211–220

    Google Scholar 

  11. Soori, H., Prilepok, M., Platos, J., Snášel, V.: Utilizing text similarity measurement for data compression to detect plagiarism in Czech. In: Afro-European Conference for Industrial Advancement, pp. 163–172. Springer International Publishing Jan 2015

    Google Scholar 

  12. Khan, I.H., Siddiqui, M.A., Jambi, K.M., Imran, M., Bagais, A.A.: Query optimization in Arabic plagiarism detection: an empirical study. Int. J. Intell. Syst. Appl. (IJISA) 7(1), 73 (2014)

    Google Scholar 

  13. Cosma, G., Joy, M.: An approach to source-code plagiarism detection and investigation using latent semantic analysis. IEEE Trans. Comput. 61(3), 379–394 (2012)

    Article  MathSciNet  Google Scholar 

  14. Alzahrani, S., Salim, N.: Fuzzy semantic-based string similarity for extrinsic plagiarism detection. Braschler and Harman (2010)

    Google Scholar 

  15. Kent, C. K., Salim, N.: Web based cross language semantic plagiarism detection. In: IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing (DASC), 2011, pp. 1096–1102. IEEE Dec 2011

    Google Scholar 

  16. Osman, A.H., Salim, N., Binwahlan, M.S., Alteeb, R., Abuobieda, A.: An improved plagiarism detection scheme based on semantic role labeling. Appl. Soft Comput. 12(5), 1493–1502 (2012)

    Article  Google Scholar 

  17. Stamatatos, E.: Intrinsic plagiarism detection using character n-gram profiles. Threshold 2, 1–500

    Google Scholar 

  18. Seaward, L., Matwin, S.: Intrinsic plagiarism detection using complexity analysis. In: Proceedings of the SEPLN, pp. 56–61 Sept 2009

    Google Scholar 

  19. Stein, B., Lipka, N., Prettenhofer, P.: Intrinsic plagiarism analysis. Lang. Resour. Eval. 45(1), 63–82 (2011)

    Article  Google Scholar 

  20. Oberreuter, G., L’Huillier, G., Rıos, S.A., Velásquez, J.D.: Approaches for intrinsic and external plagiarism detection. In: Proceedings of the PAN (2011)

    Google Scholar 

  21. Pala, K., Vsiansky, J.: Slovník českých synonym. Lidové noviny (Vocabulary of Czech synonyms) (1996)

    Google Scholar 

  22. Pala, K., Všianský, J.: http://extensions.openoffice.org/en/project/czech-dictionary-pack-ceske-slovniky-cs-cz (1994–2008)

  23. Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24(5), 530–536 (1978)

    Google Scholar 

Download references

Acknowledgments

This work was supported by the IT4Innovations Centre of Excellence project (CZ.1.05/1.1.00/02.0070), funded by the European Regional Development Fund and the national budget of the Czech Republic via the Research and Development for Innovations Operational Programme and by Project SP2015/105 “DPDM - Database of Performance and Dependability Models” of the Student Grand System, VŠB - Technical University of Ostrava and by Project SP2015/146 “Parallel processing of Big data 2” of the Student Grand System, VŠB - Technical University of Ostrava.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hussein Soori .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Soori, H., Prilepok, M., Platos, J., Snasel, V. (2016). Semantic and Similarity Measure Methods for Plagiarism Detection of Students’ Assignments. In: Abraham, A., Wegrzyn-Wolska, K., Hassanien, A., Snasel, V., Alimi, A. (eds) Proceedings of the Second International Afro-European Conference for Industrial Advancement AECIA 2015. Advances in Intelligent Systems and Computing, vol 427. Springer, Cham. https://doi.org/10.1007/978-3-319-29504-6_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-29504-6_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-29503-9

  • Online ISBN: 978-3-319-29504-6

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics