Skip to main content

Near Duplicate Text Detection Using Frequency-Biased Signatures

  • Conference paper
Web Information Systems Engineering – WISE 2013 (WISE 2013)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8180))

Included in the following conference series:

Abstract

As the use of electronic documents are becoming more popular, people want to find documents completely or partially duplicate. In this paper, we propose a near duplicate text detection framework using signatures to save space and query time. We also propose a novel signature selection algorithm which uses collection frequency of q-grams. We compare our algorithm with Winnowing, which is one of the state-of-the-art signature selection algorithms. We show that our algorithm acquires much better accuracy with less time and space cost. We perform extensive experiments to verify our conclusion.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alonso, O., Fetterly, D., Manasse, M.: Duplicate news story detection revisited. Tech. Rep. 60, Microsoft Research (2013)

    Google Scholar 

  2. Bjørner, N., Blass, A., Gurevich, Y.: Content-dependent chunking for differential compression, the local maximum approach. J. Comput. Syst. Sci. 76(3-4), 154–203 (2010)

    Article  Google Scholar 

  3. Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. In: SIGMOD Conference, pp. 398–409 (1995)

    Google Scholar 

  4. Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations (extended abstract). In: STOC, pp. 327–336 (1998)

    Google Scholar 

  5. Butakov, S., Scherbinin, V.: On the number of search queries required for internet plagiarism detection. In: ICALT, pp. 482–483 (2009)

    Google Scholar 

  6. Charikar, M.: Similarity estimation techniques from rounding algorithms. In: STOC, pp. 380–388 (2002)

    Google Scholar 

  7. Chowdhury, A., Frieder, O., Grossman, D.A., McCabe, M.C.: Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst. 20(2), 171–191 (2002)

    Article  Google Scholar 

  8. Conrad, J.G., Guo, X.S., Schriber, C.P.: Online duplicate document detection: signature reliability in a dynamic retrieval environment. In: CIKM, pp. 443–452 (2003)

    Google Scholar 

  9. Hamid, O.A., Behzadi, B., Christoph, S., Henzinger, M.R.: Detecting the origin of text segments efficiently. In: WWW, pp. 61–70 (2009)

    Google Scholar 

  10. Hua, N., Song, H., Lakshman, T.V.: Variable-stride multi-pattern matching for scalable deep packet inspection. In: INFOCOM, pp. 415–423 (2009)

    Google Scholar 

  11. Jiang, J., Tang, Y., Liu, B., Xu, Y., Wang, X.: Skip finite automaton: A content scanning engine to secure enterprise networks. In: GLOBECOM, pp. 1–5 (2010)

    Google Scholar 

  12. Kolcz, A., Chowdhury, A., Alspector, J.: Improved robustness of signature-based near-replica detection via lexicon randomization. In: KDD, pp. 605–610 (2004)

    Google Scholar 

  13. Manber, U.: Finding similar files in a large file system. In: USENIX Winter, pp. 1–10 (1994)

    Google Scholar 

  14. Manku, G.S., Jain, A., Sarma, A.D.: Detecting near-duplicates for web crawling. In: WWW, pp. 141–150 (2007)

    Google Scholar 

  15. Mittelbach, A., Lehmann, L., Rensing, C., Steinmetz, R.: Automatic detection of local reuse. In: Wolpers, M., Kirschner, P.A., Scheffel, M., Lindstaedt, S., Dimitrova, V. (eds.) EC-TEL 2010. LNCS, vol. 6383, pp. 229–244. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  16. Potthast, M., Barrón-Cedeño, A., Eiselt, A., Stein, B., Rosso, P.: Overview of the 2nd international competition on plagiarism detection. In: CLEF (Notebook Papers/LABs/Workshops) (2010)

    Google Scholar 

  17. Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: Local algorithms for document fingerprinting. In: SIGMOD Conference, pp. 76–85 (2003)

    Google Scholar 

  18. Seo, J., Croft, W.B.: Local text reuse detection. In: SIGIR, pp. 571–578 (2008)

    Google Scholar 

  19. Theobald, M., Siddharth, J., Paepcke, A.: Spotsigs: robust and efficient near duplicate detection in large web collections. In: SIGIR, pp. 563–570 (2008)

    Google Scholar 

  20. Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. 36(3), 15 (2011)

    Article  Google Scholar 

  21. Yang, H., Callan, J.P.: Near-duplicate detection by instance-level constrained clustering. In: SIGIR, pp. 421–428 (2006)

    Google Scholar 

  22. Zhang, J., Suel, T.: Efficient search in large textual collections with redundancy. In: WWW, pp. 411–420 (2007)

    Google Scholar 

  23. Zhang, Q., Wu, Y., Ding, Z., Huang, X.: Learning hash codes for efficient content reuse detection. In: SIGIR, pp. 405–414 (2012)

    Google Scholar 

  24. Zhang, X., Qin, J., Wang, W., Sun, Y., Lu, J.: Hmsearch: An efficient hamming distance query processing algorithm. In: SSDBM (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sun, Y., Qin, J., Wang, W. (2013). Near Duplicate Text Detection Using Frequency-Biased Signatures. In: Lin, X., Manolopoulos, Y., Srivastava, D., Huang, G. (eds) Web Information Systems Engineering – WISE 2013. WISE 2013. Lecture Notes in Computer Science, vol 8180. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41230-1_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-41230-1_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-41229-5

  • Online ISBN: 978-3-642-41230-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics