Data mining techniques for data cleaning

  • Kalaivany Natarajan
  • Jiuyong Li
  • Andy Koronios


Data quality is a main issue in quality information management. Data quality problems occur anywhere in information systems. These problems are solved by data cleaning. Data cleaning is a process used to determine inaccurate, incomplete or unreasonable data and then improve the quality through correcting of detected errors and omissions. Generally data cleaning reduces errors and improves the data quality. Correcting errors in data and eliminating bad records can be a time consuming and tedious process but it cannot be ignored. Data mining is a key technique for data cleaning. Data mining is a technique for discovery interesting information in data. Data quality mining is a recent approach applying data mining techniques to identify and recover data quality problems in large databases. Data mining automatically extract hidden and intrinsic information from the collections of data. Data mining has various techniques that are suitable for data cleaning. In this paper we discuss three major data mining methods, namely functional dependency mining, association rule mining and Bagging SVMs for data cleaning. We discuss strengths and weakness of these data mining methods for data cleaning.


Data Mining Data Quality Association Rule Functional Dependency Data Mining Technique 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Rakesh Agrawal and Ramakrishnan Srikant, (1994) Fast algorithms for mining association rules in large databases, VLDB (Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, eds.), Morgan Kaufmann. pp. 487–499.Google Scholar
  2. 2.
    M. Berry and G. Linoff, (1999) Mastering data mining, New York: Wiley.Google Scholar
  3. 3.
    Leo Breiman, (1996) Bagging predictors, Machine Learning 24, no. 2, 123–140.MATHMathSciNetGoogle Scholar
  4. 4.
    Diego Calvanese, Giuseppe De Giacomo, and Maurizio Lenzerini, (2001) Identification constraints and functional dependencies in description logics, IJCAI (Bernhard Nebel, ed.), Morgan Kaufmann, pp. 155–160.Google Scholar
  5. 5.
    Surajit Chaudhuri and Umeshwar Dayal, (1997) An overview of data warehousing and olap technology, SIGMOD Record 26, no. 1, 65–74.CrossRefGoogle Scholar
  6. 6.
    Ming-Syan Chen, Jiawei Han, and Philip S. Yu, (1996) Data mining: An overview from a database perspective, IEEE Trans. Knowl. Data Eng. 8, no. 6, 866–883.CrossRefGoogle Scholar
  7. 7.
    Ian Davidson, Ashish Grover, Ashwin Satyanarayana, and Giri Kumar Tayi, ( 2004) A general approach to incorporate data quality matrices into data mining algorithms, KDD (Won Kim, Ron Kohavi, Johannes Gehrke, and William DuMouchel, eds.), ACM, pp. 794–798.Google Scholar
  8. 8.
    Anne M. Disney and Philip M. Johnson, (1998) Investigation data quality problems in the psp, SIGSOFT FSE, pp. 143–152.Google Scholar
  9. 9.
    AnHai Doan, Pedro Domingos, and Alon Y. Levy, (2000) Learning source description for data integration, WebDB (Informal Proceedings), pp. 81–86.Google Scholar
  10. 10.
    Jack E.Olson, (2003) Data quality: The accuracy dimension, Morgan Kaufman , ISBN: 1558608915.Google Scholar
  11. 11.
    Usama M. Fayyad and Ramasamy Uthurusamy, ((1996) Data mining and knowledge discovery in databases (introduction to the special section), Commun. ACM 39, no. 11, 24–26.CrossRefGoogle Scholar
  12. 12.
    Galhardas.H, D. Florescu, D. Shasha, and Simon.E, ( 1999) An extensible framework for data cleaning, Tech. report, Institute National de Recherche en Informatique et en Automatique.Google Scholar
  13. 13.
    WILLIAMS P. H., MARGULES C. R., and HILBERT D. W, (2002) Data requirements and data sources for biodiversity priority area selection, Journal of biosciences ISSN 0250-5991 vol. 27, no. no 4, pp. 327–338.CrossRefGoogle Scholar
  14. 14.
    M.A Hernandez and J.S Stolfo, (1998) Real-world data is dirty: Data cleansing and the merge/purge problem, Data Mining and knowledge Discovery 2, 9–37.CrossRefGoogle Scholar
  15. 15.
    Jochen Hipp, Ulrich Gu¨ntzer, and Udo Grimmer, (2001) Data quality mining - making a virute of necessity, DMKD.Google Scholar
  16. 16.
    Ykä Huhtala, Juha Karkkainen, Pasi Porkka, and Hannu Toivonen, Tane, (1999) An efficient algorithm for discovering functional and approximate dependencies, Comput. J. 42, no. 2, 100–111.MATHCrossRefGoogle Scholar
  17. 17.
    Ihab F. Ilyas, Volker Markl, Peter J. Haas, Paul Brown, and Ashraf Aboulnaga, Cords: (2004) Automatic discovery of correlations and soft functional dependencies, SIGMOD Conference (Gerhard Weikum, Arnd Christian Konig, and Stefan Deßloch, eds.), ACM, pp. 647–658.Google Scholar
  18. 18.
    Tomasz Imielinski and Aashu Virmani, (1998) Association rules... and what’s next? towards second generation data mining systems, ADBIS (Witold Litwin, Tadeusz Morzy, and Gottfried Vossen, eds.), Lecture Notes in Computer Science, vol. 1475, Springer, pp. 6–25.Google Scholar
  19. 19.
    David J.Hand, (2007) Principles of data mining, Drug Safety, pp. 30,621–622.Google Scholar
  20. 20.
    Kollayut Kaewbuadee, Yae Temtanapat, and Ratchata Peachavanish, (2006) Data cleaning using functional dependency from data mining process, International Journal on Computer Science and Information System (IADIS) V1 , no. 2, 117–131 ,ISBN: ISSN : 1646 – 3692.Google Scholar
  21. 21.
    Vipul Kashyap and Amit Sheth, (1996) Semantic heterogeneity in global information systems:the role of metadata context and ontologies, Tech. report, Department of Computer Science University of Georgia,Athens.Google Scholar
  22. 22.
    Wen-Syan Li and Chris Clifton, Semint, (2000) A tool for identifying attribute correspondences in heterogeneous databases using neural networks, Data Knowl. Eng. 33, no. 1, 49–84.MATHCrossRefGoogle Scholar
  23. 23.
    Chong K. Liew, Uinam J. Choi, and Chung J. Liew, (1985) A data distortion by probability distribution, ACM Trans. Database Syst. 10, no. 3, 395–411.MATHCrossRefGoogle Scholar
  24. 24.
    Jonathan I. Maletic and Andrian Marcus, (2000) Data cleansing: Beyond integrity analysis, IQ (Barbara D. Klein and Donald F. Rossin, eds.), MIT, pp. 200–209.Google Scholar
  25. 25.
    David Meyer, Friedrich Leisch, and Kurt Hornik, (2003) The support vector machine under test, Neurocomputing 55, no. 1-2, 169–186.CrossRefGoogle Scholar
  26. 26.
    L. Moss, (1998) Data cleansing : A dichotomy of data warehousing, Tech. report, DM Review, February 1998.Google Scholar
  27. 27.
    Heiko Muller and Johann-Christoph Freytag, ( 2003) Problems, methods and challenges in comprehensive data cleansing, Tech. Report HUB-1B-164, Humboldt University,Berlin.Google Scholar
  28. 28.
    Krishnamurty Muralidhar and Rathindra Sarathy, (2003) A theoretical basis for perturbation methods, Statistics and Computing 13, no. 4, 329–335.CrossRefMathSciNetGoogle Scholar
  29. 29.
    Felix Naumann, Johann Christoph Freytag, and Myra Spiliopoulou, ( 1998) Quality driven source selection using data envelope analysis, IQ (InduShobha N. Chengalur-Smith and Leo Pipino, eds.), MIT, pp. 137–152.Google Scholar
  30. 30.
    Elizabeth M. Pierce, Assessing data quality with control matrices, Commun. ACM 47 (2004), no. 2, 82–86.CrossRefGoogle Scholar
  31. 31.
    Erhard Rahm and Hong Hai Do, (2000) Data cleaning: Problems and current approaches, IEEE Data Eng. Bull. 23, no. 4, 3–13.Google Scholar
  32. 32.
    Vijayshankar Raman and Joseph M. Hellerstein, (2001) Potter’s wheel: An interactive data cleaning system, VLDB (Peter M. G. Apers, Paolo Atzeni, Stefano Ceri, Stefano Paraboschi, Kotagiri Ramamohanarao, and Richard T. Snodgrass, eds.), Morgan Kaufmann, pp. 381–390.Google Scholar
  33. 33.
    Ronald.K.Pearson, ( 2 0 0 5 ) Mining imperfect data: Dealing with contamination and incomplete records, SIAM,Society for Industrial and Applied Mathematics, ISBN-10:0898715828, ISBN-13:978-0898715828, April,1 2005.Google Scholar
  34. 34.
    S.L.Kendal and M.Creen, (2007) An introduction to knowledge engineering, London :Springer 2007,x,287 p:ill 24cm.Google Scholar
  35. 35.
    Michael Stonebraker and Joseph M. Hellerstein, (2001) Content integration for e-business, SIGMOD ’01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data (New York, NY, USA), ACM, pp. 552–560.Google Scholar
  36. 36.
    Diane M. Strong, Yang W. Lee, and Richard Y. Wang, ((1997) Data quality in context, Commun. ACM 40, no. 5, 103–110.CrossRefGoogle Scholar
  37. 37.
    Daleniu T., ( 1977) Towards methodology for statistical disclosure control, Statistisktidskrift 5.Google Scholar
  38. 38.
    Graham Williams, Data mining desktop survival guide, Togaware Pty.Ltd, 24,May 2009.Google Scholar
  39. 39.
    Yi Yu Yao and Ning Zhong, (2000) On association , similarity and dependency of attributes, PAKDD, Springer Verlag Berlin Heidelberg, pp. 138–141.Google Scholar

Copyright information

© Springer-Verlag 2010

Authors and Affiliations

  • Kalaivany Natarajan
    • 1
    • 2
  • Jiuyong Li
    • 1
    • 2
  • Andy Koronios
    • 1
    • 2
  1. 1.CRC for Integrated Engineering Asset ManagementBrisbaneAustralia
  2. 2.School of Computer and Information ScienceUniversity of South AustraliaMawson LakesAustralia

Personalised recommendations