Dependency Discovery in Data Quality

  • Daniele Barone
  • Fabio Stella
  • Carlo Batini
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6051)


A conceptual framework for the automatic discovery of dependencies between data quality dimensions is described. Dependency discovery consists in recovering the dependency structure for a set of data quality dimensions measured on attributes of a database. This task is accomplished through the data mining methodology, by learning a Bayesian Network from a database. The Bayesian Network is used to analyze dependency between data quality dimensions associated with different attributes. The proposed framework is instantiated on a real world database. The task of dependency discovery is presented in the case when the following data quality dimensions are considered; accuracy, completeness, and consistency. The Bayesian Network model shows how data quality can be improved while satisfying budget constraints.


Data quality Bayesian networks Data mining 


  1. 1.
    Gackowski, Z.: Logical interdependence of some attributes of data/information quality. In: Proc. of the 9th Intl. Conference on Information Quality, Cambridge, MA, USA, pp. 126–140 (2004)Google Scholar
  2. 2.
    Ballou, D.P., Pazer, H.L.: Designing information systems to optimize the accuracy-timeliness tradeoff. Information Sys. Research 6(1), 51–72 (1995)CrossRefGoogle Scholar
  3. 3.
    Han, Q., Venkatasubramanian, N.: Addressing timeliness/accuracy/cost tradeoffs in information collection for dynamic environments. In: Proc. of the 24th IEEE Intl. Real-Time Systems Symposium, Washington, DC, USA, p. 108 (2003)Google Scholar
  4. 4.
    Ballou, D.P., Pazer, H.L.: Modeling completeness versus consistency tradeoffs in information decision contexts. IEEE Trans. Knowl. Data Eng. 15(1), 240–243 (2003)CrossRefGoogle Scholar
  5. 5.
    Sadeghi, A., Clayton, R.: The quality vs. timeliness tradeoffs in the BLS ES-202 administrative statistics. In: Federal Committee on Statistical Methodology (2002)Google Scholar
  6. 6.
    Fisher, C., Eitel, L., Chengalur-Smith, S., Wang, R.: Introduction to Information Quality, p. 126. The MIT Press, Poughkeepsie (2006)Google Scholar
  7. 7.
    DeAmicis, F., Barone, D., Batini, C.: An analytical framework to analyze dependencies among data quality dimensions. In: Proc. of the 11th Intl. Conference on Information Quality, pp. 369–383. MIT, Cambridge (2006)Google Scholar
  8. 8.
    Burstein, F. (ed.): Handbook on decision support systems. Intl. handbooks on information systems. Springer, Heidelberg (2008)Google Scholar
  9. 9.
    Berner, E., Kasiraman, R., Yu, F., Ray, M.N., Houston, T.: Data quality in the outpatient setting: impact on clinical decision support systems. In: AMIA Annu. Symp. Proc., vol. 41 (2005)Google Scholar
  10. 10.
    Eckerson, W.: Data Quality and the Bottom Line: Achieving Business Success through a Commitment to High Quality Data. Technical report, The Data Warehousing Institute (2002)Google Scholar
  11. 11.
    Oei, J.L.H., Proper, H.A., Falkenberg, E.D.: Evolving information systems: meeting the ever-changing environment. Information Sys. Journal 4(3), 213–233 (1994)CrossRefGoogle Scholar
  12. 12.
    Batini, C., Cappiello, C., Francalanci, C., Maurino, A.: Methodologies for data quality assessment and improvement. ACM Comput. Surv. 41(3), 1–52 (2009)CrossRefGoogle Scholar
  13. 13.
    International Organization for Standardization: Software engineering – Software product Quality Requirements and Evaluation (SQuaRE) – data quality model. In: ISO/IEC 25012 (2008)Google Scholar
  14. 14.
    Jensen, F.V.: Bayesian Networks and Decision Graphs. Springer, Heidelberg (2001)zbMATHGoogle Scholar
  15. 15.
    Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., San Francisco (1988)Google Scholar
  16. 16.
    Baldi, P., Frasconi, P., Smyth, P.: Modeling the internet and the WEB: Probabilistic methods and algorithms. Wiley, Chichester (2003)Google Scholar
  17. 17.
    Heckerman, D.: A tutorial on learning Bayesian networks. Technical Report MSR-TR-95-06, Microsoft Research (1995)Google Scholar
  18. 18.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)CrossRefGoogle Scholar
  19. 19.
    Reiter, R.: On closed world data bases. In: Logic and Data Bases, pp. 55–76 (1977)Google Scholar
  20. 20.
    Jarke, M., Jeusfeld, M., Quix, C., Vassiliadis, P.: Architecture and quality in data warehouses: an extended repository approach (1999)Google Scholar
  21. 21.
    Lee, Y.W., Strong, D.M., Kahn, B.K., Wang, R.Y.: AIMQ: a methodology for information quality assessment. Information Management 40(2), 133–146 (2002)CrossRefGoogle Scholar
  22. 22.
    Chiang, F., Miller, R.J.: Discovering data quality rules. PVLDB 1(1), 1166–1177 (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Daniele Barone
    • 1
  • Fabio Stella
    • 2
  • Carlo Batini
    • 2
  1. 1.Department of Computer ScienceUniversity of TorontoTorontoCanada
  2. 2.Department of Informatics, Systems and CommunicationUniversity of Milano-BicoccaMilanoItaly

Personalised recommendations