A Formal Taxonomy to Improve Data Defect Description

  • João Marcelo Borovina JoskoEmail author
  • Marcio Katsumi Oikawa
  • João Eduardo Ferreira
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9645)


Data quality assessment outcomes are essential for analytical processes, especially for big data environment. Its efficiency and efficacy depends on automated solutions, which are determined by understanding the problem associated with each data defect. Despite the considerable number of works that describe data defects regarding to accuracy, completeness and consistency, there is a significant heterogeneity of terminology, nomenclature, description depth and number of examined defects. To cover this gap, this work reports a taxonomy that organizes data defects according to a three-step methodology. The proposed taxonomy enhances the descriptions and coverage of defects with regard to the related works, and also supports certain requirements of data quality assessment, including the design of semi-supervised solutions to data defect detection.


Data defects Dirty data Formal taxonomy Data quality assessment Relational database Big data 



This work has been supported by CNPq (Brazilian National Research Council) grant number 141647/2011-6 and FAPESP (Sao Paulo State Research Foundation) grant number 2015/01587-0.


  1. 1.
    Almutiry, O., Wills, G., Crowder, R.: A dimension-oriented taxonomy of data quality problems in electronic health records. In: 13th IADIS International Conference on e-Society, pp. 98–114. IADIS, Portugal (2015)Google Scholar
  2. 2.
    Barateiro, J., Galhardas, H.: A survey of data quality tools. Datenbank-Spektrum 14, 15–21 (2005)Google Scholar
  3. 3.
    Borek, A., Woodall, P., Oberhofer, M., Parlikad, A.K.: A classification of data quality assessment methods. In: 16th International Conference on Information Quality, pp. 189–203. IEEE Press, New York (2011)Google Scholar
  4. 4.
    English, L.P.: Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits. Wiley, New York (1999)Google Scholar
  5. 5.
    Fan, W., Geerts, F.: Foundations of Data Quality Management. Morgan & Claypool Publishers, San Rafael (2012)zbMATHGoogle Scholar
  6. 6.
    Grefen, P.: Combining theory and practice in integrity control: a declarative approach to the specification of a transaction modification subsystem. In: 19th International Conference on Very Large Data Bases, pp. 581–591. Morgan Kaufmann Publishers Inc., Dublin, Ireland (1993)Google Scholar
  7. 7.
    Kim, W., Choi, B.-J., Hong, E.-K., Kim, S.-K., Lee, D.: A taxonomy of dirty data. Data Min. Knowl. Discov. 7, 81–99 (2003)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Laranjeiro, N., Soydemir, S.N., Bernardino, J.: A survey on data quality: classifying poor data. In: 21st Pacific Rim International Symposium on Dependable Computing, pp. 179–188. IEEE Press, Zhangjiajie, China (2015)Google Scholar
  9. 9.
    Li, L., Peng, T., Kennedy, J.: A rule based taxonomy of dirty data. GSTF Int. J. Comput. 1, 140–148 (2011)CrossRefGoogle Scholar
  10. 10.
    Müller, H., Freytag, J.C.: Problems, methods, and challenges in comprehensive data cleansing. Technical report, Humboldt University Berlin (2005)Google Scholar
  11. 11.
    Maier, D.: The Theory of Relational Databases. Computer Science Press, Rockville (1983)zbMATHGoogle Scholar
  12. 12.
    Naumann, F.: Data profiling revisited. ACM SIGMOD Rec. 42, 40–49 (2014)CrossRefGoogle Scholar
  13. 13.
    Oliveira, P., Rodrigues, F., Henriques, P.: A formal definition of data quality problems. In: International Conference on Information Quality, pp. 181–184. IEEE Press, New York (2005)Google Scholar
  14. 14.
    Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Bull. Tech. Comm. Data Eng. 23, 3–13 (2000)Google Scholar
  15. 15.
    Schmid, J.: The main steps to data quality. In: Perner, P. (ed.) ICDM 2004. LNCS (LNAI), vol. 3275, pp. 69–77. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  16. 16.
    Winkler, W.E.: Methods for evaluating and creating data quality. Inf. Syst. 29, 531–550 (2004)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • João Marcelo Borovina Josko
    • 1
    Email author
  • Marcio Katsumi Oikawa
    • 2
  • João Eduardo Ferreira
    • 1
  1. 1.Institute of Mathematics and StatisticsUniversity of São PauloSao PauloBrazil
  2. 2.Center of Mathematics, Computing and CognitionFederal University of ABCSanto AndreBrazil

Personalised recommendations