Abstract
This paper aims at detecting semantic plagiarism in Czech texts. The paper integrates a similarity measure technique previously used for text compression along with a synonyms structured thesaurus and a stemming algorithm to detect rewording and restructuring of texts in Czech language. Out of a 100 GB corpus, we extracted 884 files of B.A., M.A., and Ph.D. students’ assignments, semester works and theses, from Computer Science major. The total size of the extracted testing data used was 1.98 GB of plain text for our initial experiment. The method is tested first on short texts. Then, the method is applied on longer texts of students’ assignments. Our results on short texts showed more accurate results to detect paraphrased texts of semantic similarity, but lower accuracy was detected in case of identical texts with rearranged paragraphs. Our results experiment conducted on the long texts corpus of students’ assignment and theses show a semantic plagiarism rate of 23.9 %. However, after manual scanning of documents, some noise results occur as a result of using the same technical terms and scientific definitions and references in bibliography lists in different documents. These results will be fine-tuned and optimized in the future by building a file—specific stop word list, additional exact match method and removing references and other standard text templates often used in certain parts of students’ assignment works and theses.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Chuda, D., Uhlik, M.: The plagiarism detection by compression method. In: Proceedings of the 12th International Conference on Computer Systems and Technologies, pp. 429–434. ACM June 2011
Pala, K., Vsiansky, J.: Slovník českých synonym. Lidové noviny (1996)
Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison Wesley (2005)
Konchady, M.: Building Search Applications: Lucene, LingPipe, and Gate 1st edn. Mustru Publishing (2008)
Lee, M.D., Pincombe, B., Welsh, M.: An empirical evaluation of models of text document similarity. In: Bara, B.G., Barsalou, L., Bucciarelli, M. (eds.) 27th Annual Meeting of the Cognitive Science Society, pp. 1254–1259 (2005)
Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-language plagiarism detection. Lang. Resour. Eval. 45(1), 45–62 (2011)
Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 76–85. ACM June 2003
Prilepok, M., Platos, J., Snasel, V.: Similarity based on data compression. In: Advances in Soft Computing and Its Applications, pp. 267–278. Springer, Berlin, Heidelberg (2013)
Soori, H., Prilepok, M., Platos, J., Berhan, E., Snasel, V.: Text similarity based on data compression in Arabic. In: AETA 2013: Recent Advances in Electrical Engineering and Related Sciences, pp. 211–220
Soori, H., Prilepok, M., Platos, J., Snášel, V.: Utilizing text similarity measurement for data compression to detect plagiarism in Czech. In: Afro-European Conference for Industrial Advancement, pp. 163–172. Springer International Publishing Jan 2015
Khan, I.H., Siddiqui, M.A., Jambi, K.M., Imran, M., Bagais, A.A.: Query optimization in Arabic plagiarism detection: an empirical study. Int. J. Intell. Syst. Appl. (IJISA) 7(1), 73 (2014)
Cosma, G., Joy, M.: An approach to source-code plagiarism detection and investigation using latent semantic analysis. IEEE Trans. Comput. 61(3), 379–394 (2012)
Alzahrani, S., Salim, N.: Fuzzy semantic-based string similarity for extrinsic plagiarism detection. Braschler and Harman (2010)
Kent, C. K., Salim, N.: Web based cross language semantic plagiarism detection. In: IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing (DASC), 2011, pp. 1096–1102. IEEE Dec 2011
Osman, A.H., Salim, N., Binwahlan, M.S., Alteeb, R., Abuobieda, A.: An improved plagiarism detection scheme based on semantic role labeling. Appl. Soft Comput. 12(5), 1493–1502 (2012)
Stamatatos, E.: Intrinsic plagiarism detection using character n-gram profiles. Threshold 2, 1–500
Seaward, L., Matwin, S.: Intrinsic plagiarism detection using complexity analysis. In: Proceedings of the SEPLN, pp. 56–61 Sept 2009
Stein, B., Lipka, N., Prettenhofer, P.: Intrinsic plagiarism analysis. Lang. Resour. Eval. 45(1), 63–82 (2011)
Oberreuter, G., L’Huillier, G., Rıos, S.A., Velásquez, J.D.: Approaches for intrinsic and external plagiarism detection. In: Proceedings of the PAN (2011)
Pala, K., Vsiansky, J.: Slovník českých synonym. Lidové noviny (Vocabulary of Czech synonyms) (1996)
Pala, K., Všianský, J.: http://extensions.openoffice.org/en/project/czech-dictionary-pack-ceske-slovniky-cs-cz (1994–2008)
Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24(5), 530–536 (1978)
Acknowledgments
This work was supported by the IT4Innovations Centre of Excellence project (CZ.1.05/1.1.00/02.0070), funded by the European Regional Development Fund and the national budget of the Czech Republic via the Research and Development for Innovations Operational Programme and by Project SP2015/105 “DPDM - Database of Performance and Dependability Models” of the Student Grand System, VŠB - Technical University of Ostrava and by Project SP2015/146 “Parallel processing of Big data 2” of the Student Grand System, VŠB - Technical University of Ostrava.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Soori, H., Prilepok, M., Platos, J., Snasel, V. (2016). Semantic and Similarity Measure Methods for Plagiarism Detection of Students’ Assignments. In: Abraham, A., Wegrzyn-Wolska, K., Hassanien, A., Snasel, V., Alimi, A. (eds) Proceedings of the Second International Afro-European Conference for Industrial Advancement AECIA 2015. Advances in Intelligent Systems and Computing, vol 427. Springer, Cham. https://doi.org/10.1007/978-3-319-29504-6_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-29504-6_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-29503-9
Online ISBN: 978-3-319-29504-6
eBook Packages: EngineeringEngineering (R0)