Clustering Techniques for Large Database Cleansing

Sung, Sam Y.; Li, Zhao; Ling, Tok W.

doi:10.1007/978-1-4613-0227-8_8

Clustering Techniques for Large Database Cleansing

Sam Y. Sung⁵,
Zhao Li⁵ &
Tok W. Ling⁵

Chapter

238 Accesses

Part of the book series: Network Theory and Applications ((NETA,volume 11))

Abstract

Data cleansing, also called data cleaning or data scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data [35]. It is a common problem in environments where records contain erroneous in a single database (e.g., due to misspelling during data entry, missing information and other invalid data etc.), or where multiple databases must be combined (e.g., in data warehouses, federated database systems and global web-based information systems etc.).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

R. Agrawal and H. V. Jagadish. Multiprocessor transitive closure algorithms. In Proc. Int’l Symp. On Databases in Parallel and Distributed Systems, pages 56–66, December 1988.
Chapter Google Scholar
M. A. Bickel. Automatic correction to misspelled names: a fourth-generation language approach. Communications of the ACM, pages 30 (3): 224–228, 1987.
Article Google Scholar
D. Bitton and D. J. DeWitt. Duplicate record elimination in large data files. ACM Transactions on Database Systems, pages 8 (2): 255–265, 1983.
MATH Google Scholar
S. Chaudhuri and U. Dayal. An overview of data warehousing and olap technology. In ACM SIGMOD Record, page 26 (1), 1997.
Google Scholar
M. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from database perspective. IEEE Transactions on Knowledge and Data Engineering, pages 8 (6): 866–883, 1996.
Google Scholar
W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proceedings of the ACM SIGMOD International Conference on Managemnet of Data, 1998.
Google Scholar
T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press, 1990.
Google Scholar
D. J. DeWitt, J. F. Naughton, and D. A. Schneider. An evaluation of non-equijoin algorithms. In Proc. 17th Int’l. Conf. on Very Large Databases, pages 443–452, Barcelona, Spain, December 1991.
Google Scholar
C.L. Forgy. Op85 iser’s manual. Technical Report CMU-CS-81–135, Carnegie Mellon University, July 1981.
Google Scholar
E. J. Friedman-Hill. Jess, the java expert system shell, 1999. Available from http://herzberg.ca.sandia.gov/jess.
Google Scholar
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. A. Saita. Declarative data cleaning: Language, mode, and algorithms In Proc. 27th Int’l. Conf. on Very Large Databases, pages 371–380, Roma, Italy, 2001.
Google Scholar
S. Ghandeharizadeh. Physical Database Design in Multiprocessor Database Systems. PhD thesis, Department of Computer Science, University of Wisconsin - Madison, 1990.
Google Scholar
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In Proc. 27th Intl. Conf. on Very Large Databases, pages 491–500, Roma, Italy, 2001.
Google Scholar
P. Gulutzan and T. Pelzer. SQL-99 Complete, Really. R&D Books, 1999.
Google Scholar
D. Guseld. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997.
Book Google Scholar
M. Hernandez. A generalization of band joins and the merge/purge problem. Technical Report CULS-005–1995, Columbia University, February 1996.
Google Scholar
M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In Proceedings of the ACM SIGMOD International Conference on Managemnet of Data, pages 127–138, May 1995.
Google Scholar
M. Hernandez and S. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, Vol. 2, No. 1: 9–37, 1998.
Article Google Scholar
D. S. Hirschberg. Algorithms for the longest common subsequence problem. Journal of the ACM, 24: 664–675, 1977.
Article MathSciNet MATH Google Scholar
M. L. Jarke, M. Vassiliou, and P. Vassiliadis. Fundamentals of data warehouses. Springer, 2000.
Google Scholar
R. Kimball. Dealing with dirty data. DBMS online, September 1996. Available from http://www.dbmsmag.com/9609d14.html
Google Scholar
K. Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys, pages 24 (4): 377–439, 1992.
Google Scholar
L. V. S. Lakshmanan, F. Sadri, and I. N. Subramanian. SchemaSQL - a language for interoperability in relational multi-database systems. In Proc. 22nd Int’l. Conf. on Very Large Databases, pages 239–250, Mumbai, 1996.
Google Scholar
M. L. Lee, T. W. Ling, and W. L. Low. Intelliclean: A knowledge-based intelligent data cleaner. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 290–294, 2000.
Chapter Google Scholar
M. L. Lee, T. W. Ling, and W. L. Low. A knowledge-based framework for intelligent data cleansing. Information System Journal - Special Issue on Data Extraction and Cleaning, 2001.
Google Scholar
M. L. Lee, H. J. Lu, T. W. Ling, and Y. T.Ko. Cleansing data for mining and warehousing. In Proceedings of the 10th International Conference on Database and Expert Systems Applications (DEXA), pages 751–760, 1999.
Google Scholar
Infoshare Limited. Best value guide to data standardizing. InfoDB, July 1998. Available from http://www.infoshare.ltd.uk
A. Marzal and E. Vidal. Computation of normalized edit distances and applications. In IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 15(9):926–932, 1993.
Article Google Scholar
DataCleanser DataBlade Module. http://www.informix.com/informix/products/options/udo/datablade/dbmodule/eddl.htm
A. E. Monge. Matching algorithm within a duplicate detection system. In IEEE Data Engineering Bulletin, volume 23(4), December 2000.
Google Scholar
A. E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Proceeding of the ACM-SIGMOD Workshop on Research Issues on Knowledge Discovery and Data Mining, Tucson, AZ, 1997.
Google Scholar
L. Moss. Data cleansing: A dichotomy of data ware-housing? DM Review, February 1998. Available from http://www.dmreview.com/editorial/dmreview/print.action.cfm?EdID=828.
Google Scholar
Dictionary of Algorithms and Data Structures. http://www.nist.gov/dads/.
X. Y. Qi, S. Y. Sung, C. Lu, Z. Li, and P. Sun. Field similarity algorithm. In Sixth International Conference on Computer Science and Informatics, pages 432–436, Durham, NC, USA, March 2002.
Google Scholar
E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. In IEEE Data Engineering Bulletin, volume 23(4), December 2000.
Google Scholar
V. Raman and J. M. Hellerstein. Potter’s wheel: An interactive data cleaning system. In Proc. 27th Int’l. Conf. on Very Large Databases, pages 381–390, Rome, 2001.
Google Scholar
G. Riley. A tool for building expert systems, 2002. Available from http://www.ghg.net/clips/CLIPS.html.
A. Silberschatz, M. StoneBraker, and J. Ullman. Database research: Achievements and opportunities into the 21st century. In SIGMOD Record (ACM Special Interest Group on Management of Data), page 25(1):52, 1996.
Google Scholar
T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, pages 147: 195–197, 1981.
Article Google Scholar
R. E. Tarjan. Efficiency of a good but not linear set union algorithm. Jouanal of the ACM, 22 (2): 215–225, 1975.
Article MathSciNet MATH Google Scholar
E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14 (3): 249–260, 1995.
Article MathSciNet MATH Google Scholar
R. Wagner and M. Fisher. The string to string correction problem. Jouanal of the ACM, 21 (1): 168–173, 1974.
Article MATH Google Scholar
R. Y. Wang, M. P. Reddy, and H. B. Kon. Towards quality data: An attribute-based approach. Decision Support Systems, 13, 1995.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, National University of Singapore, Singapore, 119260, USA
Sam Y. Sung, Zhao Li & Tok W. Ling

Authors

Sam Y. Sung
View author publications
You can also search for this author in PubMed Google Scholar
Zhao Li
View author publications
You can also search for this author in PubMed Google Scholar
Tok W. Ling
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Sung, S.Y., Li, Z., Ling, T.W. (2004). Clustering Techniques for Large Database Cleansing. In: Clustering and Information Retrieval. Network Theory and Applications, vol 11. Springer, Boston, MA. https://doi.org/10.1007/978-1-4613-0227-8_8

Download citation

DOI: https://doi.org/10.1007/978-1-4613-0227-8_8
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4613-7949-2
Online ISBN: 978-1-4613-0227-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics