Detecting mis-entered values in large data sets

  • Kalaivany Natarajan
  • Jiuyong Li
  • Andy Koronios
Conference paper


Data is the valuable asset of business organizations and companies. Quality data is essential for business intelligence and intelligence decision-making. Data quality is a main issue in quality information management. Data quality control has been aware of by most large business organizations. Various mechanisms have been employed to ensure obtaining quality data, for example, using electronic forms for data collection. With the popularity of collecting data from electronic forms, mis-entered values become a major source of dirty values in a database. Mis-entered values can be caused b y randomly ticking multiple choices from drop down selection lists. These dirty values are more inconspicuous than traditional data entry errors and misspellings since mis-entered values have right spelling and normally do not caused integrity violation. In this paper, we discuss some data mining methods that are used for detecting mis-entered values in large data sets. We present a framework for detecting mis-entered values using association rules.


Cervical Cancer Association Rule Data Cleaning Data Mining Method Unbiased Sample 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Sarabjot S. Anand, David A. Bell, and John G. Hughes (1995) The role of domain knowledge in data mining, CIKM ’95: Proceedings of the fourth international conference on Information and knowledge management (New York, NY, USA), ACM, 37–43.Google Scholar
  2. 2.
    L. Breiman, J.H. Friedman, R.A. Olshen and C.J. Stone (1984) Classification and regression trees, Wadsworth International, Canada.Google Scholar
  3. 3.
    Leo Breiman (1996) Bagging predictors, Machine Learning, Springer Netherlands, 24, 123–140.Google Scholar
  4. 4.
    D. DesJardins (2001) Outliners, inliers and just plain liars-new graphical eda+ (eda plus) techniques for understanding data., In Proc.SAS User’s Group International Conference(SUG126) (Long Beach, CA).Google Scholar
  5. 5.
    Dennis Howitt Duncan Cramer (2004) The sage dictionary of statistics, p. 76.Google Scholar
  6. 6.
    Robin High (2000) Dealing with ’outliers’: How to maintain your data’s integrity in computing news, Tech. Report.Google Scholar
  7. 7.
    Ming Hua and Jian Pei (2007) Cleaning disguised missing data: a heuristic approach, KDD (Pavel Berkhin, RichCaruana, and Xindong Wu, eds.), ACM, 950–958.Google Scholar
  8. 8.
    Jeremy Kubica and Andrew W. Moore (2003) Probabilistic noise identification and data cleaning, ICDM, IEEE Computer Society, 131–138.Google Scholar
  9. 9.
    Jussi Myllymaki (2001) Effective web data extraction with standard xml technologies, ACM 1-58113-348, Hong Kong.Google Scholar
  10. 10.
    Oracle database documentation Library (2005) Oracle: Data mining concepts, 10g Release 2(10.2) ed..Google Scholar
  11. 11.
    Ronald K. Pearson (2006) The problem of disguised missing data, SIGKDD Explorations 8(1), 83–92.CrossRefGoogle Scholar
  12. 12.
    Erhard Rahm and Hong Hai Do, Data cleaning (2000) Problems and current approaches, IEEE Data Eng. Bull. 23(4), 3–13.Google Scholar
  13. 13.
    Vijayshankar Raman and Joseph M. Hellerstein (2001) Potter’s wheel: An interactive data cleaning system, VLDB (Peter M. G. Apers, Paolo Atzeni, Stefano Ceri, Stefano Paraboschi, Kotagiri Ramamohanarao, and Richard T. Snodgrass, eds.), Morgan Kaufmann, 381–390.Google Scholar
  14. 14.
    Brian D. Ripley (1996) Pattern recognition and neural networks, Cambridge University Press.Google Scholar
  15. 15.
    R.J.Serfling (1980) Approximation theorems of mathematical statistics, John Wiley and Sons.Google Scholar
  16. 16.
    Ronald.K.Pearson (2005) Mining imperfect data: Dealing with contamination and incomplete records, SIAM, Society for Industrial and Applied Mathematics, ISBN-10:0898715828, ISBN-13:978-0898715828.Google Scholar
  17. 17.
    William M.K. Trochim (2006) Research methods and knowledge base, Atomic dog publishing.Google Scholar
  18. 18.
    John H.Mccoll Valerie J.Easton (1997) Statistics glossary, Tech. Report.Google Scholar
  19. 19.
    V.Barnett and T.Lewis (1994) Outliers in statistical data, 3rd ed., Wiley.Google Scholar

Copyright information

© Springer-Verlag 2010

Authors and Affiliations

  • Kalaivany Natarajan
    • 1
    • 2
  • Jiuyong Li
    • 1
    • 2
  • Andy Koronios
    • 1
    • 2
  1. 1.CRC for Integrated Engineering Asset ManagementBrisbaneAustralia
  2. 2.System Integration and IT School of Computer and Information ScienceUniversity of South AustraliaMawson LakesAustralia

Personalised recommendations