Skip to main content

An Overview of XML Duplicate Detection Algorithms

  • Chapter
Soft Computing in XML Data Management

Part of the book series: Studies in Fuzziness and Soft Computing ((STUDFUZZ,volume 255))

Abstract

Fuzzy duplicate detection aims at identifying multiple representations of real-world objects in a data source, and is a task of critical relevance in data cleaning, data mining, and data integration tasks. It has a long history for relational data, stored in a single table or in multiple tables with an equal schema. However, algorithms for fuzzy duplicate detection in more complex structures, such as hierarchies of a data warehouse, XML data, or graph data have only recently emerged. These algorithms use similarity measures that consider the duplicate status of their direct neighbors to improve duplicate detection effectiveness. In this chapter, we study different approaches that have been proposed for XML fuzzy duplicate detection. Our study includes a description and analysis of the different approaches, as well as a comparative experimental evaluation performed on both artificial and real-world data. The two main dimensions used for comparison are the methods effectiveness and efficiency. Our comparison shows that the DogmatiX system [44] is the most effective overall, as it yields the highest recall and precision values for various kinds of differences between duplicates. Another system, called XMLDup [27] has a similar performance, being most effective especially at low recall values. Finally, the SXNM system [36] is the most efficient, as it avoids executing too many pairwise comparisons, but its effectiveness is greatly affected by errors in the data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: Conference on Very Large Databases (VLDB), Hong Kong, China, pp. 586–597 (2002)

    Google Scholar 

  2. Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern information retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston (1999)

    Google Scholar 

  3. Batini, C., Scannapieco, M.: Data quality - concepts, methodologies and techniques. Springer, Berlin (2006)

    MATH  Google Scholar 

  4. Bhattacharya, I., Getoor, L.: Relational clustering for multi-type entity resolution. In: KDD Workshop on Multi-Relational Data Mining, MRDM (2005)

    Google Scholar 

  5. Bhattacharya, I., Getoor, L.: A latent dirichlet model for unsupervised entity resolution. In: Conference on Data Mining (SDM), Bethesda, MD (2006)

    Google Scholar 

  6. Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data 1(1) (2007)

    Google Scholar 

  7. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Conference on Knowledge Discovery and Data Mining (KDD), Washington, DC, pp. 39–48 (2003)

    Google Scholar 

  8. Bille, P.: A survey on tree edit distance and related problems. Theoretical Computer Science 337(1–3), 217–239 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  9. Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. 41(1), 1–41 (2008), http://doi.acm.org/10.1145/1456650.1456651

    Article  Google Scholar 

  10. Carvalho, J.C.P., da Silva, A.S.: Finding similar identities among objects from multiple web sources. In: CIKM Workshop on Web Information and Data Management (WIDM), New Orleans, Louisiana, USA, pp. 90–93 (2003)

    Google Scholar 

  11. Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: Conference on the Management of Data (SIGMOD), San Diego, CA, pp. 313–324 (2003)

    Google Scholar 

  12. Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: International Conference on Data Engineering (ICDE), Tokyo, Japan, pp. 865–876 (2005)

    Google Scholar 

  13. Chen, Z., Kalashnikov, D.V., Mehrotra, S.: Exploiting relationships for object consolidation. In: SIGMOD Workshop on Information Quality in Information Systems (IQIS), Baltimore, MD (2005)

    Google Scholar 

  14. Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Conference on Knowledge Discovery and Data Mining (KDD), Edmonton, Alberta, Canada, pp. 475–480 (2002)

    Google Scholar 

  15. Doan, A., Lu, Y., Lee, Y., Han, J.: Profile-based object matching for information integration. IEEE Intelligent Systems 18(5), 54–59 (2003)

    Article  Google Scholar 

  16. Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. In: Conference on the Management of Data (SIGMOD), Baltimore, MD, pp. 85–96 (2005)

    Google Scholar 

  17. Elfeky, M.G., Elmagarmid, A.K., Verykios, V.S.: Tailor: A record linkage tool box. In: International Conference on Data Engineering (ICDE), San Jose, CA, pp. 17–28 (2002)

    Google Scholar 

  18. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19(1) (2007)

    Google Scholar 

  19. Fellegi, I.P., Sunter, A.B.: A theory for record linkage. Journal of the American Statistical Association (1969)

    Google Scholar 

  20. Guha, S., Jagadish, H.V., Koudas, N., Srivastava, D., Yu, T.: Approximate XML joins. In: Conference on the Management of Data (SIGMOD), Madison, WI (2002)

    Google Scholar 

  21. Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Conference on the Management of Data (SIGMOD), San Jose, CA, pp. 127–138 (1995)

    Google Scholar 

  22. Herschel, M., Naumann, F.: Scaling up duplicate detection in graph data. In: Conference on Information and Knowledge Management (CIKM), pp. 1325–1326 (2008)

    Google Scholar 

  23. Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: Conference on Database Systems for Advanced Applications (DASFAA), Kyoto, Japan (2003)

    Google Scholar 

  24. Kalashnikov, D.V., Mehrotra, S.: Domain-independent data cleaning via analysis of entity-relationship graph. ACM Transactions on Database Systems (TODS) 31(2), 716–767 (2006)

    Article  Google Scholar 

  25. Kuhn, H.W.: The hungarian method for the assignment problem. Naval Research Logistics Quarterly 2(1–2), 83–97 (1955)

    Article  MathSciNet  Google Scholar 

  26. Lehti, P., Fankhauser, P.: Unsupervised duplicate detection using sample non-duplicates. Journal on Data Semantics VII 4244, 136–164 (2006)

    Article  Google Scholar 

  27. Leitão, L., Calado, P., Weis, M.: Structure-based inference of XML similarity for fuzzy duplicate detection. In: Conference on Information and Knowledge Management (CIKM), pp. 293–302 (2007)

    Google Scholar 

  28. Levenshtein, V.: Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady 10, 707 (1966)

    MathSciNet  Google Scholar 

  29. Lim, E.P., Srivastava, J., Prabhakar, S., Richardson, J.: Entity identification in database integration. In: International Conference on Data Engineering (ICDE), Vienna, Austria, pp. 294–301 (1993)

    Google Scholar 

  30. McCallum, A., Wellner, B.: Object consolidation by graph partitioning with a conditionally-trained distance metric. In: KDD Workshop on Data Cleaning, Record Linkage and Object Consolidation, Washington, DC (2003)

    Google Scholar 

  31. Milano, D., Scannapieco, M., Catarci, T.: Structure aware XML object identification. In: VLDB Workshop on Clean Databases (CleanDB), Seoul, Korea (2006)

    Google Scholar 

  32. Minkov, E., Cohen, W.W., Ng, A.Y.: Contextual search and name disambiguation in email using graphs. In: Conference on Research and Development in Information Retrieval (SIGIR), Seattle, Washington, USA, pp. 27–34 (2006)

    Google Scholar 

  33. Monge, A.E., Elkan, C.P.: An efficient domain-independent algorithm for detecting approximately duplicate database records. In: SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD), Tucson, AZ (1997)

    Google Scholar 

  34. Newcombe, H., Kennedy, J., Axford, S., James, A.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)

    Article  Google Scholar 

  35. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of plausible inference, 2nd edn. Morgan Kaufmann Publishers, San Francisco (1988)

    Google Scholar 

  36. Puhlmann, S., Weis, M., Naumann, F.: XML duplicate detection using sorted neigborhoods. In: Conference on Extending Database Technology (EDBT), Munich, Germany, pp. 773–791 (2006)

    Google Scholar 

  37. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB Journal 10(4), 334–350 (2001)

    Article  MATH  Google Scholar 

  38. Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin 23, 3–13 (2000)

    Google Scholar 

  39. Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Conference on Knowledge Discovery and Data Mining (KDD), Edmonton, Alberta, pp. 269–278 (2002)

    Google Scholar 

  40. Singla, P., Domingos, P.: Multi-relational record linkage. In: KDD Workshop on Multi-Relational Data Mining (MRDM), Seattle, WA (2004)

    Google Scholar 

  41. Singla, P., Domingos, P.: Object identification with attribute-mediated dependences. In: Conference on Principals and Practice of Knowledge Discovery in Databases (PKDD), Porto, Portugal, pp. 297–308 (2005)

    Google Scholar 

  42. Weis, M.: Duplicate Detection in XML Data. WiKu-Verlag Verlag fuer Wissenschaft und Kultur (2008)

    Google Scholar 

  43. Weis, M., Naumann, F.: Duplicate detection in XML. In: SIGMOD Workshop on Information Quality in Information Systems (IQIS), Paris, France, pp. 10–19 (2004)

    Google Scholar 

  44. Weis, M., Naumann, F.: Dogmatix tracks down duplicates in XML. In: Conference on the Management of Data (SIGMOD), Baltimore, MD, pp. 431–442 (2005)

    Google Scholar 

  45. Weis, M., Naumann, F.: Detecting duplicates in complex XML data. In: International Conference on Data Engineering (ICDE), Atlanta, GA (2006)

    Google Scholar 

  46. Weis, M., Naumann, F., Jehle, U., Lufter, J., Schuster, H.: Industry-scale duplicate detection. In: Proceedings of the VLDB Endowment (PVLDB), vol. 1(2), pp. 1253–1264 (2008)

    Google Scholar 

  47. Whang, S.E., Menestrina, D., Koutrika, G., Tin Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: Conference on the Management of Data (SIGMOD). Providence, Rhode Island (2009)

    Google Scholar 

  48. Yin, X., Han, J., Yu, P.S.: LinkClus: Efficient clustering via heterogeneous semantic links. In: Conference on Very Large Databases (VLDB), Seoul, Korea, pp. 427–438 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Calado, P., Herschel, M., Leitão, L. (2010). An Overview of XML Duplicate Detection Algorithms. In: Ma, Z., Yan, L. (eds) Soft Computing in XML Data Management. Studies in Fuzziness and Soft Computing, vol 255. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14010-5_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-14010-5_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-14009-9

  • Online ISBN: 978-3-642-14010-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics