Advertisement

Near Duplicate Document Detection for Large Information Flows

  • Daniele Montanari
  • Piera Laura Puglisi
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7465)

Abstract

Near duplicate documents and their detection are studied to identify info items that convey the same (or very similar) content, possibly surrounded by diverse sets of side information like metadata, advertisements, timestamps, web presentations and navigation supports, and so on. Identification of near duplicate information allows the implementation of selection policies aiming to optimize an information corpus and therefore improve its quality.

In this paper, we introduce a new method to find near duplicate documents based on q-grams extracted from the text. The algorithm exploits three major features: a similarity measure comparing document q-gram occurrences to evaluate the syntactic similarity of the compared texts; an indexing method maintaining an inverted index of q-gram; and an efficient allocation of the bitmaps using a window size of 24 hours supporting the documents comparison process.

The proposed algorithm has been tested in a multifeed news content management system to filter out duplicated news items coming from different information channels. The experimental evaluation shows the efficiency and the accuracy of our solution compared with other existing techniques. The results on a real dataset report a F-measure of 9.53 with a similarity threshold of 0.8.

Keywords

duplicate information flows q-grams 

References

  1. 1.
    Berson, T.A.: Differential Cryptanalysis Mod 2 with Applications to MD5. In: Rueppel, R.A. (ed.) EUROCRYPT 1992. LNCS, vol. 658, pp. 71–80. Springer, Heidelberg (1993)CrossRefGoogle Scholar
  2. 2.
    Zhe, W., et al.: Clean-living: Eliminating Near-Duplicates in lifetime Personal Storage. Technical Report (September 2005)Google Scholar
  3. 3.
    Kumar, J.P., et al.: Duplicate and Near Duplicate Documents Detection: A Review. European Journal of Scientific Research (2009)Google Scholar
  4. 4.
    Udi, M.: Finding Similar Files in a Large File System. In: USENIX Winter Technical Conference, CA (January 1994)Google Scholar
  5. 5.
    Andrei, Z., et al.: Some applications of Rabin’s fingerprinting method. Sequences II: Methods in Communications, Security, and Computer Science. Springer (1993)Google Scholar
  6. 6.
    Chowdhury, A., et al.: Collection statistics for fast duplicate document detection. ACM Transaction on Information Systems 20(2), 171–191 (2002)CrossRefGoogle Scholar
  7. 7.
    Broder, A.Z.: Identifying and Filtering Near-Duplicate Documents. In: Proceedings of COM 2000 (2000)Google Scholar
  8. 8.
    Gravano, L., et al.: Approximate string joins in a database (almost) for free. In: VLDB 2001 (2001)Google Scholar
  9. 9.
    Ilinsky, et al.: An efficient method to detect duplicates of Web documents with the use of inverted indexGoogle Scholar
  10. 10.
    Ferro, A., Giugno, R., Puglisi, P.L., Pulvirenti, A.: An Efficient Duplicate Record Detection Using q-Grams Array Inverted Index. In: Pedersen, T.B., Mohania, M.K., Tjoa, A.M. (eds.) DaWaK 2010. LNCS, vol. 6263, pp. 309–323. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  11. 11.
    Theobald, et al.: SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections. In: Proceedings of SIGIR (2008)Google Scholar
  12. 12.
    Indyk, P., et al.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC (1998)Google Scholar
  13. 13.
  14. 14.
  15. 15.
    Kolcz, A., et al.: Improved robustness of signature-based near replica detection via lexicon randomization. In: KDD 2004 (2004)Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2012

Authors and Affiliations

  • Daniele Montanari
    • 1
  • Piera Laura Puglisi
    • 2
  1. 1.ICT eni - Semantic TechnologiesBolognaItaly
  2. 2.GESPBolognaItaly

Personalised recommendations