Abstract
Data cleansing, also called data cleaning or data scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data [35]. It is a common problem in environments where records contain erroneous in a single database (e.g., due to misspelling during data entry, missing information and other invalid data etc.), or where multiple databases must be combined (e.g., in data warehouses, federated database systems and global web-based information systems etc.).
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
R. Agrawal and H. V. Jagadish. Multiprocessor transitive closure algorithms. In Proc. Int’l Symp. On Databases in Parallel and Distributed Systems, pages 56–66, December 1988.
M. A. Bickel. Automatic correction to misspelled names: a fourth-generation language approach. Communications of the ACM, pages 30 (3): 224–228, 1987.
D. Bitton and D. J. DeWitt. Duplicate record elimination in large data files. ACM Transactions on Database Systems, pages 8 (2): 255–265, 1983.
S. Chaudhuri and U. Dayal. An overview of data warehousing and olap technology. In ACM SIGMOD Record, page 26 (1), 1997.
M. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from database perspective. IEEE Transactions on Knowledge and Data Engineering, pages 8 (6): 866–883, 1996.
W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proceedings of the ACM SIGMOD International Conference on Managemnet of Data, 1998.
T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press, 1990.
D. J. DeWitt, J. F. Naughton, and D. A. Schneider. An evaluation of non-equijoin algorithms. In Proc. 17th Int’l. Conf. on Very Large Databases, pages 443–452, Barcelona, Spain, December 1991.
C.L. Forgy. Op85 iser’s manual. Technical Report CMU-CS-81–135, Carnegie Mellon University, July 1981.
E. J. Friedman-Hill. Jess, the java expert system shell, 1999. Available from http://herzberg.ca.sandia.gov/jess.
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. A. Saita. Declarative data cleaning: Language, mode, and algorithms In Proc. 27th Int’l. Conf. on Very Large Databases, pages 371–380, Roma, Italy, 2001.
S. Ghandeharizadeh. Physical Database Design in Multiprocessor Database Systems. PhD thesis, Department of Computer Science, University of Wisconsin - Madison, 1990.
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In Proc. 27th Intl. Conf. on Very Large Databases, pages 491–500, Roma, Italy, 2001.
P. Gulutzan and T. Pelzer. SQL-99 Complete, Really. R&D Books, 1999.
D. Guseld. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997.
M. Hernandez. A generalization of band joins and the merge/purge problem. Technical Report CULS-005–1995, Columbia University, February 1996.
M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In Proceedings of the ACM SIGMOD International Conference on Managemnet of Data, pages 127–138, May 1995.
M. Hernandez and S. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, Vol. 2, No. 1: 9–37, 1998.
D. S. Hirschberg. Algorithms for the longest common subsequence problem. Journal of the ACM, 24: 664–675, 1977.
M. L. Jarke, M. Vassiliou, and P. Vassiliadis. Fundamentals of data warehouses. Springer, 2000.
R. Kimball. Dealing with dirty data. DBMS online, September 1996. Available from http://www.dbmsmag.com/9609d14.html
K. Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys, pages 24 (4): 377–439, 1992.
L. V. S. Lakshmanan, F. Sadri, and I. N. Subramanian. SchemaSQL - a language for interoperability in relational multi-database systems. In Proc. 22nd Int’l. Conf. on Very Large Databases, pages 239–250, Mumbai, 1996.
M. L. Lee, T. W. Ling, and W. L. Low. Intelliclean: A knowledge-based intelligent data cleaner. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 290–294, 2000.
M. L. Lee, T. W. Ling, and W. L. Low. A knowledge-based framework for intelligent data cleansing. Information System Journal - Special Issue on Data Extraction and Cleaning, 2001.
M. L. Lee, H. J. Lu, T. W. Ling, and Y. T.Ko. Cleansing data for mining and warehousing. In Proceedings of the 10th International Conference on Database and Expert Systems Applications (DEXA), pages 751–760, 1999.
Infoshare Limited. Best value guide to data standardizing. InfoDB, July 1998. Available from http://www.infoshare.ltd.uk
A. Marzal and E. Vidal. Computation of normalized edit distances and applications. In IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 15(9):926–932, 1993.
DataCleanser DataBlade Module. http://www.informix.com/informix/products/options/udo/datablade/dbmodule/eddl.htm
A. E. Monge. Matching algorithm within a duplicate detection system. In IEEE Data Engineering Bulletin, volume 23(4), December 2000.
A. E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Proceeding of the ACM-SIGMOD Workshop on Research Issues on Knowledge Discovery and Data Mining, Tucson, AZ, 1997.
L. Moss. Data cleansing: A dichotomy of data ware-housing? DM Review, February 1998. Available from http://www.dmreview.com/editorial/dmreview/print.action.cfm?EdID=828.
Dictionary of Algorithms and Data Structures. http://www.nist.gov/dads/.
X. Y. Qi, S. Y. Sung, C. Lu, Z. Li, and P. Sun. Field similarity algorithm. In Sixth International Conference on Computer Science and Informatics, pages 432–436, Durham, NC, USA, March 2002.
E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. In IEEE Data Engineering Bulletin, volume 23(4), December 2000.
V. Raman and J. M. Hellerstein. Potter’s wheel: An interactive data cleaning system. In Proc. 27th Int’l. Conf. on Very Large Databases, pages 381–390, Rome, 2001.
G. Riley. A tool for building expert systems, 2002. Available from http://www.ghg.net/clips/CLIPS.html.
A. Silberschatz, M. StoneBraker, and J. Ullman. Database research: Achievements and opportunities into the 21st century. In SIGMOD Record (ACM Special Interest Group on Management of Data), page 25(1):52, 1996.
T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, pages 147: 195–197, 1981.
R. E. Tarjan. Efficiency of a good but not linear set union algorithm. Jouanal of the ACM, 22 (2): 215–225, 1975.
E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14 (3): 249–260, 1995.
R. Wagner and M. Fisher. The string to string correction problem. Jouanal of the ACM, 21 (1): 168–173, 1974.
R. Y. Wang, M. P. Reddy, and H. B. Kon. Towards quality data: An attribute-based approach. Decision Support Systems, 13, 1995.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2004 Kluwer Academic Publishers
About this chapter
Cite this chapter
Sung, S.Y., Li, Z., Ling, T.W. (2004). Clustering Techniques for Large Database Cleansing. In: Clustering and Information Retrieval. Network Theory and Applications, vol 11. Springer, Boston, MA. https://doi.org/10.1007/978-1-4613-0227-8_8
Download citation
DOI: https://doi.org/10.1007/978-1-4613-0227-8_8
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4613-7949-2
Online ISBN: 978-1-4613-0227-8
eBook Packages: Springer Book Archive