Skip to main content

Clustering Techniques for Large Database Cleansing

  • Chapter
  • 238 Accesses

Part of the book series: Network Theory and Applications ((NETA,volume 11))

Abstract

Data cleansing, also called data cleaning or data scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data [35]. It is a common problem in environments where records contain erroneous in a single database (e.g., due to misspelling during data entry, missing information and other invalid data etc.), or where multiple databases must be combined (e.g., in data warehouses, federated database systems and global web-based information systems etc.).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. R. Agrawal and H. V. Jagadish. Multiprocessor transitive closure algorithms. In Proc. Int’l Symp. On Databases in Parallel and Distributed Systems, pages 56–66, December 1988.

    Chapter  Google Scholar 

  2. M. A. Bickel. Automatic correction to misspelled names: a fourth-generation language approach. Communications of the ACM, pages 30 (3): 224–228, 1987.

    Article  Google Scholar 

  3. D. Bitton and D. J. DeWitt. Duplicate record elimination in large data files. ACM Transactions on Database Systems, pages 8 (2): 255–265, 1983.

    MATH  Google Scholar 

  4. S. Chaudhuri and U. Dayal. An overview of data warehousing and olap technology. In ACM SIGMOD Record, page 26 (1), 1997.

    Google Scholar 

  5. M. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from database perspective. IEEE Transactions on Knowledge and Data Engineering, pages 8 (6): 866–883, 1996.

    Google Scholar 

  6. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proceedings of the ACM SIGMOD International Conference on Managemnet of Data, 1998.

    Google Scholar 

  7. T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press, 1990.

    Google Scholar 

  8. D. J. DeWitt, J. F. Naughton, and D. A. Schneider. An evaluation of non-equijoin algorithms. In Proc. 17th Int’l. Conf. on Very Large Databases, pages 443–452, Barcelona, Spain, December 1991.

    Google Scholar 

  9. C.L. Forgy. Op85 iser’s manual. Technical Report CMU-CS-81–135, Carnegie Mellon University, July 1981.

    Google Scholar 

  10. E. J. Friedman-Hill. Jess, the java expert system shell, 1999. Available from http://herzberg.ca.sandia.gov/jess.

    Google Scholar 

  11. H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. A. Saita. Declarative data cleaning: Language, mode, and algorithms In Proc. 27th Int’l. Conf. on Very Large Databases, pages 371–380, Roma, Italy, 2001.

    Google Scholar 

  12. S. Ghandeharizadeh. Physical Database Design in Multiprocessor Database Systems. PhD thesis, Department of Computer Science, University of Wisconsin - Madison, 1990.

    Google Scholar 

  13. L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In Proc. 27th Intl. Conf. on Very Large Databases, pages 491–500, Roma, Italy, 2001.

    Google Scholar 

  14. P. Gulutzan and T. Pelzer. SQL-99 Complete, Really. R&D Books, 1999.

    Google Scholar 

  15. D. Guseld. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997.

    Book  Google Scholar 

  16. M. Hernandez. A generalization of band joins and the merge/purge problem. Technical Report CULS-005–1995, Columbia University, February 1996.

    Google Scholar 

  17. M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In Proceedings of the ACM SIGMOD International Conference on Managemnet of Data, pages 127–138, May 1995.

    Google Scholar 

  18. M. Hernandez and S. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, Vol. 2, No. 1: 9–37, 1998.

    Article  Google Scholar 

  19. D. S. Hirschberg. Algorithms for the longest common subsequence problem. Journal of the ACM, 24: 664–675, 1977.

    Article  MathSciNet  MATH  Google Scholar 

  20. M. L. Jarke, M. Vassiliou, and P. Vassiliadis. Fundamentals of data warehouses. Springer, 2000.

    Google Scholar 

  21. R. Kimball. Dealing with dirty data. DBMS online, September 1996. Available from http://www.dbmsmag.com/9609d14.html

    Google Scholar 

  22. K. Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys, pages 24 (4): 377–439, 1992.

    Google Scholar 

  23. L. V. S. Lakshmanan, F. Sadri, and I. N. Subramanian. SchemaSQL - a language for interoperability in relational multi-database systems. In Proc. 22nd Int’l. Conf. on Very Large Databases, pages 239–250, Mumbai, 1996.

    Google Scholar 

  24. M. L. Lee, T. W. Ling, and W. L. Low. Intelliclean: A knowledge-based intelligent data cleaner. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 290–294, 2000.

    Chapter  Google Scholar 

  25. M. L. Lee, T. W. Ling, and W. L. Low. A knowledge-based framework for intelligent data cleansing. Information System Journal - Special Issue on Data Extraction and Cleaning, 2001.

    Google Scholar 

  26. M. L. Lee, H. J. Lu, T. W. Ling, and Y. T.Ko. Cleansing data for mining and warehousing. In Proceedings of the 10th International Conference on Database and Expert Systems Applications (DEXA), pages 751–760, 1999.

    Google Scholar 

  27. Infoshare Limited. Best value guide to data standardizing. InfoDB, July 1998. Available from http://www.infoshare.ltd.uk

  28. A. Marzal and E. Vidal. Computation of normalized edit distances and applications. In IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 15(9):926–932, 1993.

    Article  Google Scholar 

  29. DataCleanser DataBlade Module. http://www.informix.com/informix/products/options/udo/datablade/dbmodule/eddl.htm

  30. A. E. Monge. Matching algorithm within a duplicate detection system. In IEEE Data Engineering Bulletin, volume 23(4), December 2000.

    Google Scholar 

  31. A. E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Proceeding of the ACM-SIGMOD Workshop on Research Issues on Knowledge Discovery and Data Mining, Tucson, AZ, 1997.

    Google Scholar 

  32. L. Moss. Data cleansing: A dichotomy of data ware-housing? DM Review, February 1998. Available from http://www.dmreview.com/editorial/dmreview/print.action.cfm?EdID=828.

    Google Scholar 

  33. Dictionary of Algorithms and Data Structures. http://www.nist.gov/dads/.

  34. X. Y. Qi, S. Y. Sung, C. Lu, Z. Li, and P. Sun. Field similarity algorithm. In Sixth International Conference on Computer Science and Informatics, pages 432–436, Durham, NC, USA, March 2002.

    Google Scholar 

  35. E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. In IEEE Data Engineering Bulletin, volume 23(4), December 2000.

    Google Scholar 

  36. V. Raman and J. M. Hellerstein. Potter’s wheel: An interactive data cleaning system. In Proc. 27th Int’l. Conf. on Very Large Databases, pages 381–390, Rome, 2001.

    Google Scholar 

  37. G. Riley. A tool for building expert systems, 2002. Available from http://www.ghg.net/clips/CLIPS.html.

  38. A. Silberschatz, M. StoneBraker, and J. Ullman. Database research: Achievements and opportunities into the 21st century. In SIGMOD Record (ACM Special Interest Group on Management of Data), page 25(1):52, 1996.

    Google Scholar 

  39. T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, pages 147: 195–197, 1981.

    Article  Google Scholar 

  40. R. E. Tarjan. Efficiency of a good but not linear set union algorithm. Jouanal of the ACM, 22 (2): 215–225, 1975.

    Article  MathSciNet  MATH  Google Scholar 

  41. E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14 (3): 249–260, 1995.

    Article  MathSciNet  MATH  Google Scholar 

  42. R. Wagner and M. Fisher. The string to string correction problem. Jouanal of the ACM, 21 (1): 168–173, 1974.

    Article  MATH  Google Scholar 

  43. R. Y. Wang, M. P. Reddy, and H. B. Kon. Towards quality data: An attribute-based approach. Decision Support Systems, 13, 1995.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Kluwer Academic Publishers

About this chapter

Cite this chapter

Sung, S.Y., Li, Z., Ling, T.W. (2004). Clustering Techniques for Large Database Cleansing. In: Clustering and Information Retrieval. Network Theory and Applications, vol 11. Springer, Boston, MA. https://doi.org/10.1007/978-1-4613-0227-8_8

Download citation

  • DOI: https://doi.org/10.1007/978-1-4613-0227-8_8

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4613-7949-2

  • Online ISBN: 978-1-4613-0227-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics