Efficient Semantic-Aware Detection of Near Duplicate Resources

  • Ekaterini Ioannou
  • Odysseas Papapetrou
  • Dimitrios Skoutas
  • Wolfgang Nejdl
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6089)


Efficiently detecting near duplicate resources is an important task when integrating information from various sources and applications. Once detected, near duplicate resources can be grouped together, merged, or removed, in order to avoid repetition and redundancy, and to increase the diversity in the information provided to the user. In this paper, we introduce an approach for efficient semantic-aware near duplicate detection, by combining an indexing scheme for similarity search with the RDF representations of the resources. We provide a probabilistic analysis for the correctness of the suggested approach, which allows applications to configure it for satisfying their specific quality requirements. Our experimental evaluation on the RDF descriptions of real-world news articles from various news agencies demonstrates the efficiency and effectiveness of our approach.


near duplicate detection data integration 


  1. 1.
    Aleman-Meza, B., Nagarajan, M., Ramakrishnan, C., Ding, L., Kolari, P., Sheth, A.P., Arpinar, I.B., Joshi, A., Finin, T.: Semantic analytics on social networks: experiences in addressing the problem of conflict of interest detection. In: WWW, pp. 407–416 (2006)Google Scholar
  2. 2.
    Bawa, M., Condie, T., Ganesan, P.: LSH forest: self-tuning indexes for similarity search. In: WWW, pp. 651–660 (2005)Google Scholar
  3. 3.
    Bhattacharya, I., Getoor, L.: Deduplication and group detection using links. In: Workshop on Link Analysis and Group Detection, ACM SIGKDD (2004)Google Scholar
  4. 4.
    Bhattacharya, I., Getoor, L.: Iterative record linkage for cleaning and integration. In: DMKD, pp. 11–18 (2004)Google Scholar
  5. 5.
    Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations (extended abstract). In: STOC (1998)Google Scholar
  6. 6.
    Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: STOC, pp. 327–336 (2002)Google Scholar
  7. 7.
    Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string distance metrics for name-matching tasks. In: Workshop on Inf. Integration on the Web (2003)Google Scholar
  8. 8.
    Datar, M., Indyk, P.: Locality-sensitive hashing scheme based on p-stable distributions. In: SCG 2004: Proceedings of the twentieth annual symposium on Computational geometry, pp. 253–262. ACM Press, New York (2004)CrossRefGoogle Scholar
  9. 9.
    Dong, X., Halevy, A.Y., Madhavan, J.: Reference reconciliation in complex information spaces. In: SIGMOD, pp. 85–96 (2005)Google Scholar
  10. 10.
    Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB, pp. 432–442 (1999)Google Scholar
  11. 11.
    Ioannou, E., Niederé, C., Nejdl, W.: Probabilistic entity linkage for heterogeneous information spaces. In: Bellahsène, Z., Léonard, M. (eds.) CAiSE 2008. LNCS, vol. 5074, pp. 556–570. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  12. 12.
    Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of tampa. American Statistical Association (1989)Google Scholar
  13. 13.
    Kalashnikov, D.V., Mehrotra, S.: Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans. Database Syst., 716–767 (2006)Google Scholar
  14. 14.
    Kalashnikov, D.V., Mehrotra, S., Chen, Z.: Exploiting relationships for domain-independent data cleaning. In: SDM (2005)Google Scholar
  15. 15.
    Manku, G.S., Jain, A., Sarma, A.D.: Detecting near-duplicates for web crawling. In: WWW, pp. 141–150 (2007)Google Scholar
  16. 16.
    Minack, E., Paiu, R., Costache, S., Demartini, G., Gaugaz, J., Ioannou, E., Chirita, P.-A., Nejdl, W.: Leveraging personal metadata for desktop search - the Beagle++ system. In: Journal of Web Semantics (2010)Google Scholar
  17. 17.
    Morrison, D.R.: PATRICIA - Practical Algorithm To Retrieve Information Coded in Alphanumeric. J. ACM (1968)Google Scholar
  18. 18.

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Ekaterini Ioannou
    • 1
  • Odysseas Papapetrou
    • 1
  • Dimitrios Skoutas
    • 1
  • Wolfgang Nejdl
    • 1
  1. 1.L3S Research Center/Leibniz Universität Hannover 

Personalised recommendations