Abstract
As the use of electronic documents are becoming more popular, people want to find documents completely or partially duplicate. In this paper, we propose a near duplicate text detection framework using signatures to save space and query time. We also propose a novel signature selection algorithm which uses collection frequency of q-grams. We compare our algorithm with Winnowing, which is one of the state-of-the-art signature selection algorithms. We show that our algorithm acquires much better accuracy with less time and space cost. We perform extensive experiments to verify our conclusion.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Alonso, O., Fetterly, D., Manasse, M.: Duplicate news story detection revisited. Tech. Rep. 60, Microsoft Research (2013)
Bjørner, N., Blass, A., Gurevich, Y.: Content-dependent chunking for differential compression, the local maximum approach. J. Comput. Syst. Sci. 76(3-4), 154–203 (2010)
Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. In: SIGMOD Conference, pp. 398–409 (1995)
Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations (extended abstract). In: STOC, pp. 327–336 (1998)
Butakov, S., Scherbinin, V.: On the number of search queries required for internet plagiarism detection. In: ICALT, pp. 482–483 (2009)
Charikar, M.: Similarity estimation techniques from rounding algorithms. In: STOC, pp. 380–388 (2002)
Chowdhury, A., Frieder, O., Grossman, D.A., McCabe, M.C.: Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst. 20(2), 171–191 (2002)
Conrad, J.G., Guo, X.S., Schriber, C.P.: Online duplicate document detection: signature reliability in a dynamic retrieval environment. In: CIKM, pp. 443–452 (2003)
Hamid, O.A., Behzadi, B., Christoph, S., Henzinger, M.R.: Detecting the origin of text segments efficiently. In: WWW, pp. 61–70 (2009)
Hua, N., Song, H., Lakshman, T.V.: Variable-stride multi-pattern matching for scalable deep packet inspection. In: INFOCOM, pp. 415–423 (2009)
Jiang, J., Tang, Y., Liu, B., Xu, Y., Wang, X.: Skip finite automaton: A content scanning engine to secure enterprise networks. In: GLOBECOM, pp. 1–5 (2010)
Kolcz, A., Chowdhury, A., Alspector, J.: Improved robustness of signature-based near-replica detection via lexicon randomization. In: KDD, pp. 605–610 (2004)
Manber, U.: Finding similar files in a large file system. In: USENIX Winter, pp. 1–10 (1994)
Manku, G.S., Jain, A., Sarma, A.D.: Detecting near-duplicates for web crawling. In: WWW, pp. 141–150 (2007)
Mittelbach, A., Lehmann, L., Rensing, C., Steinmetz, R.: Automatic detection of local reuse. In: Wolpers, M., Kirschner, P.A., Scheffel, M., Lindstaedt, S., Dimitrova, V. (eds.) EC-TEL 2010. LNCS, vol. 6383, pp. 229–244. Springer, Heidelberg (2010)
Potthast, M., Barrón-Cedeño, A., Eiselt, A., Stein, B., Rosso, P.: Overview of the 2nd international competition on plagiarism detection. In: CLEF (Notebook Papers/LABs/Workshops) (2010)
Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: Local algorithms for document fingerprinting. In: SIGMOD Conference, pp. 76–85 (2003)
Seo, J., Croft, W.B.: Local text reuse detection. In: SIGIR, pp. 571–578 (2008)
Theobald, M., Siddharth, J., Paepcke, A.: Spotsigs: robust and efficient near duplicate detection in large web collections. In: SIGIR, pp. 563–570 (2008)
Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. 36(3), 15 (2011)
Yang, H., Callan, J.P.: Near-duplicate detection by instance-level constrained clustering. In: SIGIR, pp. 421–428 (2006)
Zhang, J., Suel, T.: Efficient search in large textual collections with redundancy. In: WWW, pp. 411–420 (2007)
Zhang, Q., Wu, Y., Ding, Z., Huang, X.: Learning hash codes for efficient content reuse detection. In: SIGIR, pp. 405–414 (2012)
Zhang, X., Qin, J., Wang, W., Sun, Y., Lu, J.: Hmsearch: An efficient hamming distance query processing algorithm. In: SSDBM (2013)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sun, Y., Qin, J., Wang, W. (2013). Near Duplicate Text Detection Using Frequency-Biased Signatures. In: Lin, X., Manolopoulos, Y., Srivastava, D., Huang, G. (eds) Web Information Systems Engineering – WISE 2013. WISE 2013. Lecture Notes in Computer Science, vol 8180. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41230-1_24
Download citation
DOI: https://doi.org/10.1007/978-3-642-41230-1_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41229-5
Online ISBN: 978-3-642-41230-1
eBook Packages: Computer ScienceComputer Science (R0)