Abstract
This paper presents some research results involved in building Polish semantic Internet search engine called the Natively Enhanced Knowledge Sharing Technologies (NEKST) and its plagiarism detection module. The main goal is to describe tools and algorithms of the engine and its usage within the Open System for Antiplagiarism (OSA).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alzahrani, S.M., Salim, N., Abraham, A.: Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(2), 133–149 (2012)
Barrón-Cedeño, A., Rosso, P.: On automatic plagiarism detection based on n-grams comparison. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 696–700. Springer, Heidelberg (2009). doi:10.1007/978-3-642-00958-7_69
Becker, M., Drożdżyński, W., Krieger, H.U., Piskorski, J., Schäfer, U., Xu, F.: Sprout - shallow processing with typed feature structures and unification. In: Proceedings of ICON 2002 - International Conference on NLP, Mumbai, India (2002)
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
Botelho, F.C., Ziviani, N.: External perfect hashing for very large key sets. In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, pp. 653–662. ACM (2007)
Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. In: ACM SIGMOD Record, vol. 24, pp. 398–409. ACM (1995)
Brodal, G.S., Kaligosi, K., Katriel, I., Kutz, M.: Faster algorithms for computing longest common increasing subsequences. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 330–341. Springer, Heidelberg (2006). doi:10.1007/11780441_30
Broder, A.Z.: On the resemblance and containment of documents. In: Compression and Complexity of Sequences 1997, Proceedings, pp. 21–29. IEEE (1997)
Cárdenas, A.F.: Analysis and performance of inverted data base structures. Commun. ACM 18(5), 253–263 (1975)
Carter, J.L., Wegman, M.N.: Universal classes of hash functions. In: Proceedings of the Ninth Annual ACM Symposium on Theory of Computing, pp. 106–112. ACM (1977)
Ceglarek, D., Haniewicz, K.: Fast plagiarism detection by sentence hashing. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2012. LNCS (LNAI), vol. 7268, pp. 30–37. Springer, Heidelberg (2012). doi:10.1007/978-3-642-29350-4_4
Ceglarek, D., Haniewicz, K., Rutkowski, W.: Robust plagiary detection using semantic compression augmented SHAPD. In: Nguyen, N.-T., Hoang, K., Jȩdrzejowicz, P. (eds.) ICCCI 2012. LNCS (LNAI), vol. 7653, pp. 308–317. Springer, Heidelberg (2012). doi:10.1007/978-3-642-34630-9_32
Cichelli, R.J.: Minimal perfect hash functions made simple. Commun. ACM 23(1), 17–19 (1980)
Czerski, D., Ciesielski, K., Dramiński, M., Kłopotek, M.A., Wierzchoń, S.T.: Inverted lists compression using contextual information. In: Pejaś, J., Saeed, K. (eds.) Advances in Information Processing and Protection, pp. 55–66. Springer, Boston (2007)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Dietzfelbinger, M., Gil, J., Matias, Y., Pippenger, N.: Polynomial hash functions are reliable. In: Kuich, W. (ed.) ICALP 1992. LNCS, vol. 623, pp. 235–246. Springer, Heidelberg (1992). doi:10.1007/3-540-55719-9_77
Foundation, A.S.: Apache Hadoop (2015). http://hadoop.apache.org/
Fox, E.A., Heath, L.S., Chen, Q.F., Daoud, A.M.: Practical minimal perfect hash functions for large databases. Commun. ACM 35(1), 105–121 (1992)
Gauss, C.F.: Disquisitiones Arithmeticae, vol. 157. Yale University Press, New Haven (1966)
Gustafson, N., Pera, M.S., Ng, Y.K.: Nowhere to hide: finding plagiarized documents based on sentence similarity. In: Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, vol. 01, pp. 690–696. IEEE Computer Society (2008)
Hameurlain, A., Morvan, F.: Big Data management in the cloud: evolution or crossroad? In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds.) BDAS 2015-2016. CCIS, vol. 613, pp. 23–38. Springer, Cham (2016). doi:10.1007/978-3-319-34099-9_2
Hirschberg, D.S.: Algorithms for the longest common subsequence problem. J. ACM (JACM) 24(4), 664–675 (1977)
Hunt, J.W., Szymanski, T.G.: A fast algorithm for computing longest common subsequences. Commun. ACM 20(5), 350–353 (1977)
Iliopoulos, C.S., Rahman, M.S.: A new efficient algorithm for computing the longest common subsequence. Theor. Comput. Syst. 45(2), 355–371 (2009)
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604–613. ACM (1998)
Irving, R.W.: Plagiarism and collusion detection using the smith-waterman algorithm. DCS Technical report, University of Glasgow, pp. 1–24 (2004)
Kahn, R.E.: Deposit, registration and recordation in an electronic copyright management system. Technical report, Corporation for National Research Initiatives, Reston, Virginia (1992)
Kang, N.O., Gelbukh, A., Han, S.Y.: PPChecker: plagiarism pattern checker in document copy detection. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 661–667. Springer, Heidelberg (2006). doi:10.1007/11846406_83
Kang, N.O., Han, S.Y.: Document copy detection system based on plagiarism patterns. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 571–574. Springer, Heidelberg (2006). doi:10.1007/11671299_60
Knuth, D.E.: The Art of Computer Programming: Sorting and Searching, vol. 3. Pearson Education (1998)
Kowalski, M., Szczepański, M.: Identity of academic theses. In: Dobrzyńska, T., Kuncheva, R. (eds.) Resemblance and Difference. The problem of identity, pp. 259–278. Instytut Badań Literackich Polskiej Akademii Nauk (2015)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady 10, 707–710 (1966)
Manber, U., et al.: Finding similar files in a large file system. Usenix Winter 94, 1–10 (1994)
Masek, W.J., Paterson, M.S.: A faster algorithm computing string edit distances. J. Comput. Syst. Sci. 20(1), 18–31 (1980)
Maziarz, M., Piasecki, M., Szpakowicz, S.: Approaching plWordNet 2.0. In: Proceedings of the 6th Global WordNet Conference, Matsue, Japan, pp. 50–62 (2012)
Miłkowski, M., Lipski, J.: Using SRX standard for sentence segmentation. In: Vetulani, Z. (ed.) LTC 2009. LNCS (LNAI), vol. 6562, pp. 172–182. Springer, Heidelberg (2011). doi:10.1007/978-3-642-20095-3_16
Oechslin, P.: Making a faster cryptanalytic time-memory trade-off. In: Boneh, D. (ed.) CRYPTO 2003. LNCS, vol. 2729, pp. 617–630. Springer, Heidelberg (2003). doi:10.1007/978-3-540-45146-4_36
Osman, A.H., Salim, N., Kumar, Y.J., Abuobieda, A.: Fuzzy semantic plagiarism detection. In: Hassanien, A.E., Salem, A.-B.M., Ramadan, R., Kim, T. (eds.) AMLTA 2012. CCIS, vol. 322, pp. 543–553. Springer, Heidelberg (2012). doi:10.1007/978-3-642-35326-0_54
Pagh, R., Rodler, F.F.: Cuckoo hashing. J. Algorithms 51, 122–144 (2004)
Paik, W., Liddy, E.D., Liddy, J.H., Niles, I.H., Allen, E.E.: Information extraction system and method using concept-relation-concept (CRC) triples, 17 July 2001. US Patent 6,263,335
Percova, N.N.: On the types of semantic compression of text. In: COLING, pp. 229–231 (1982)
Piskorski, J.: Rule-based named-entity recognition for polish. In: Proceedings of the Workshop on Named-Entity Recognition for NLP Applications held in Conjunction with the 1st International Joint Conference on NLP (2004)
Polaski, E.: Wielki sownik ortograficzny PWN. Wydawnictwo Naukowe PWN (2003)
Przepirkowski, A., Bako, M., Grski, R., Lewandowska-Tomaszczyk, B.: Narodowy korpus jezyka polskiego. PWN, Warszawa (2012)
Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets, vol. 77. Cambridge University Press, Cambridge (2012)
Saloni, Z., Gruszczyński, W., Woliński, M., Wołosz, R., Skowrońska, D.: Website of the morphological analyser morfeusz (2011). http://sgjp.pl/morfeusz/index.html.en
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)
Sobieski, Ś., Kowalski, M.A., Kruszyński, P., Sysak, M., Zieliński, B., Maślanka, P.: OSA architecture. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds.) BDAS 2015-2016. CCIS, vol. 613, pp. 571–584. Springer, Cham (2016). doi:10.1007/978-3-319-34099-9_44
Szczepański, M.: Algorytmy klasyfikacji tekstów i ich wykorzystanie w systemie wykrywania plagiatów. Oficyna Wydawnicza Politechniki Warszawskiej (2014)
Szmit, R.: Locality sensitive hashing for similarity search using MapReduce on large scale data. In: Kłopotek, M.A., Koronacki, J., Marciniak, M., Mykowiecka, A., Wierzchoń, S.T. (eds.) IIS 2013. LNCS, vol. 7912, pp. 171–178. Springer, Heidelberg (2013). doi:10.1007/978-3-642-38634-3_19
Wang, T.: Integer hash function. (2007). http://www.concentric.net/~ttwang/tech/inthash.htm
White, D.R., Joy, M.S.: Sentence-based natural language plagiarism detection. J. Educ. Res. Comput. (JERIC) 4(4), 2 (2004)
Acknowledgement
The author acknowledges his contribution in the NEKST project (http://nekst.ipipan.waw.pl) founded by the Innovative Economy Operational Programme (POIG.01.01.02-14-013/09) and in the OSA project founded by the Interuniversity Centre for IT (MUCI – Miedzyuniwersyteckie Centrum Informatyzacji, http://muci.edu.pl).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Szmit, R. (2017). Fast Plagiarism Detection in Large-Scale Data. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. Towards Efficient Solutions for Data Analysis and Knowledge Representation. BDAS 2017. Communications in Computer and Information Science, vol 716. Springer, Cham. https://doi.org/10.1007/978-3-319-58274-0_27
Download citation
DOI: https://doi.org/10.1007/978-3-319-58274-0_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58273-3
Online ISBN: 978-3-319-58274-0
eBook Packages: Computer ScienceComputer Science (R0)