Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Recommended Reading
Ananthakrishna R, Chaudhuri S, Ganti V. Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th International Conference on Very Large Data Bases; 2002.
Aslam JA, Pelehov K, Rus D. A practical clustering algorithm for static and dynamic information organization. In: Proceedings of the 10th Annual ACM-SIAM Symposium on Discrete Algorithms; 1999.
Bansal N, Blum A, Chawla S. Correlation clustering. Mach Learn. 2002;56(1–3):89–113.
Bhattacharya I, Getoor L. Collective entity resolution in relational data. Q Bull IEEE TC Data Eng. 2006;29(2):4–12.
Bilenko M, Basu S, Mooney RJ. Integrating constraints and metric learning in semi-supervised clustering. In: Proceedings of the 21st International Conference on Machine Learning; 2004.
Bohannon P, Fan W, Flaster M, Rastogi R. A cost based model and effective heuristic for repairing constraints by value modification. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2005.
Charikar M, Guruswami V, Wirth A. Clustering with qualitative information. J Comput Syst Sci. 2005;71(3):360–83.
Chaudhuri S, Sarma A, Ganti V, Kaushik R. Leveraging aggregate constraints for deduplication. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2007.
Dong X, Halevy AY, Madhavan J. Reference reconciliation in complex information spaces. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2005.
Fuxman A, Fazli E, Miller RJ. ConQuer: efficient management of inconsistent databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2005.
Galhardas H, Florescu D, Shasha D, Simon E, Saita C. Declarative data cleaning: language, model, and algorithms. In: Proceedings of the 27th International Conference on Very Large Data Bases; 2001.
Koudas N, Sarawagi S, Srivastava D. Record linkage: similarity measures and algorithms. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2006.
Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2002.
Single linkage clustering. http://en.wikipedia.org/wiki/Single_linkage_clustering
The K-means clustering algorithm. http://mathworld.wolfram.com/K-MeansClusteringAlgorithm.html
Trillium software. http://www.trilliumsoft.com/trilliumsoft.nsf
Toney S. Cleanup and deduplication of an international bibliographic database. Inform Tech Lib. 1992;11(1):25.
Tung AKH, Ng RT, Lakshmanan LVS, Han J. Constraint-based clustering in large databases. In: Proceedings of the 8th International Conference on Database Theory; 2001.
Yancey WE. Bigmatch: a program for extracting probable matches from a large file for record linkage. Statistical Research Report Series RRC2002/01, US Bureau of the Census; 2002.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Section Editor information
Rights and permissions
Copyright information
© 2018 Springer Science+Business Media, LLC, part of Springer Nature
About this entry
Cite this entry
Kaushik, R. (2018). Deduplication in Data Cleaning. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_596
Download citation
DOI: https://doi.org/10.1007/978-1-4614-8265-9_596
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering