Encyclopedia of Database Systems

2018 Edition
| Editors: Ling Liu, M. Tamer Özsu

Deduplication in Data Cleaning

  • Raghav Kaushik
Reference work entry
DOI: https://doi.org/10.1007/978-1-4614-8265-9_596

Synonyms

Clustering; Mergepurge; Record matching; Reference reconciliation

Definition

Many times, the same logical real world entity has multiple representations in a relation, due to data entry errors, varying conventions, and a variety of other reasons. For example, when Lisa purchases products from SuperMart twice, she might be entered as two different customers, e.g., (Lisa Simpson, Seattle, WA, USA, 98025) and (Simson Lisa, Seattle, WA, United States, 98025). Such duplicated information can cause significant problems for the users of the data. For example, it can lead to increased direct mailing costs because several customers may be sent multiple catalogs. Or, such duplicates could cause incorrect results in analytic queries (say, the number of SuperMart customers in Seattle), and lead to erroneous data mining models. Hence, a significant amount of time and effort are spent on the task of detecting and eliminating duplicates.

This problem of detecting and eliminating duplicated...

This is a preview of subscription content, log in to check access.

Recommended Reading

  1. 1.
    Ananthakrishna R, Chaudhuri S, Ganti V. Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th International Conference on Very Large Data Bases; 2002.Google Scholar
  2. 2.
    Aslam JA, Pelehov K, Rus D. A practical clustering algorithm for static and dynamic information organization. In: Proceedings of the 10th Annual ACM-SIAM Symposium on Discrete Algorithms; 1999.Google Scholar
  3. 3.
    Bansal N, Blum A, Chawla S. Correlation clustering. Mach Learn. 2002;56(1–3):89–113.MathSciNetzbMATHGoogle Scholar
  4. 4.
    Bhattacharya I, Getoor L. Collective entity resolution in relational data. Q Bull IEEE TC Data Eng. 2006;29(2):4–12.Google Scholar
  5. 5.
    Bilenko M, Basu S, Mooney RJ. Integrating constraints and metric learning in semi-supervised clustering. In: Proceedings of the 21st International Conference on Machine Learning; 2004.Google Scholar
  6. 6.
    Bohannon P, Fan W, Flaster M, Rastogi R. A cost based model and effective heuristic for repairing constraints by value modification. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2005.Google Scholar
  7. 7.
    Charikar M, Guruswami V, Wirth A. Clustering with qualitative information. J Comput Syst Sci. 2005;71(3):360–83.CrossRefMathSciNetzbMATHGoogle Scholar
  8. 8.
    Chaudhuri S, Sarma A, Ganti V, Kaushik R. Leveraging aggregate constraints for deduplication. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2007.Google Scholar
  9. 9.
    Dong X, Halevy AY, Madhavan J. Reference reconciliation in complex information spaces. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2005.Google Scholar
  10. 10.
    Fuxman A, Fazli E, Miller RJ. ConQuer: efficient management of inconsistent databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2005.Google Scholar
  11. 11.
    Galhardas H, Florescu D, Shasha D, Simon E, Saita C. Declarative data cleaning: language, model, and algorithms. In: Proceedings of the 27th International Conference on Very Large Data Bases; 2001.Google Scholar
  12. 12.
    Koudas N, Sarawagi S, Srivastava D. Record linkage: similarity measures and algorithms. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2006.Google Scholar
  13. 13.
    Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2002.Google Scholar
  14. 14.
  15. 15.
  16. 16.
  17. 17.
    Toney S. Cleanup and deduplication of an international bibliographic database. Inform Tech Lib. 1992;11(1):25.Google Scholar
  18. 18.
    Tung AKH, Ng RT, Lakshmanan LVS, Han J. Constraint-based clustering in large databases. In: Proceedings of the 8th International Conference on Database Theory; 2001.CrossRefGoogle Scholar
  19. 19.
    Yancey WE. Bigmatch: a program for extracting probable matches from a large file for record linkage. Statistical Research Report Series RRC2002/01, US Bureau of the Census; 2002.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Microsoft ResearchRedmondUSA

Section editors and affiliations

  • Venkatesh Ganti
    • 1
  1. 1.Microsoft ResearchMicrosoft CorporationRedmondUSA