Journal of Intelligent Information Systems

, Volume 42, Issue 3, pp 595–618 | Cite as

Bayesian networks for supporting query processing over incomplete autonomous databases

  • Rohit Raghunathan
  • Sushovan De
  • Subbarao Kambhampati
Article

Abstract

As the information available to naïve users through autonomous data sources continues to increase, mediators become important to ensure that the wealth of information available is tapped effectively. A key challenge that these information mediators need to handle is the varying levels of incompleteness in the underlying databases in terms of missing attribute values. Existing approaches such as QPIAD aim to mine and use Approximate Functional Dependencies (AFDs) to predict and retrieve relevant incomplete tuples. These approaches make independence assumptions about missing values—which critically hobbles their performance when there are tuples containing missing values for multiple correlated attributes. In this paper, we present a principled probabilistic alternative that views an incomplete tuple as defining a distribution over the complete tuples that it stands for. We learn this distribution in terms of Bayesian networks. Our approach involves mining/“learning” Bayesian networks from a sample of the database, and using it to do both imputation (predict a missing value) and query rewriting (retrieve relevant results with incompleteness on the query-constrained attributes, when the data sources are autonomous). We present empirical studies to demonstrate that (i) at higher levels of incompleteness, when multiple attribute values are missing, Bayesian networks do provide a significantly higher classification accuracy and (ii) the relevant possible answers retrieved by the queries reformulated using Bayesian networks provide higher precision and recall than AFDs while keeping query processing costs manageable.

Keywords

Data cleaning Bayesian networks Query rewriting Autonomous database 

References

  1. Batista, G., & Monard, M. (2002). A study of k-nearest neighbour as an imputation method. In Soft computing systems: design, management and applications (pp. 251–260). Santiago, Chile.Google Scholar
  2. Bishop, C., et al. (2006). Pattern recognition and machine learning (Vol. 4). Springer, New York.Google Scholar
  3. Cars.com (2013). http://www.cars.com. Accessed 1 Feb 2013.
  4. CBioC: (2013) http://cbioc.eas.asu.edu/. Accessed 1 Feb 2013.
  5. Cooper, G. (1990). The computational complexity of probabilistic inference using Bayesian belief networks. Artificial Intelligence, 42(2), 393–405.CrossRefMATHMathSciNetGoogle Scholar
  6. Dempster, A., Laird, N., Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1), 1–38.Google Scholar
  7. Fernández, A., Rumí, R., Salmerón, A. (2012). Answering queries in hybrid Bayesian networks using importance sampling. Decision Support Systems, 53(3), 580–590.CrossRefGoogle Scholar
  8. Frank, A., & Asuncion, A. (2010). UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 1 Feb 2013.
  9. Geiger, D., Verma, T., Pearl, J. (1990). Identifying independence in Bayesian networks. Networks, 20(5), 507–534.CrossRefMATHMathSciNetGoogle Scholar
  10. Gupta, R., & Sarawagi, S. (2006). Creating probabilistic databases from information extraction models. In VLDB (pp. 965–976).Google Scholar
  11. Hartemink, A., et al. (2005) Banjo: Bayesian network inference with java objects. Web Site http://www.cs.duke.edu/~amink/software/banjo/. Accessed 1 Feb 2013.
  12. Heckerman, D. (1992). The certainty-factor model (2nd edn). Encyclopedia of Artificial Intelligence.Google Scholar
  13. Heckerman, D., Geiger, D., Chickering, D.M. (1995). Learning Bayesian networks: the combination of knowledge and statistical data. Machine Learning, 20(3), 197–243.MATHGoogle Scholar
  14. Heitjan, D.F., & Basu, S. (1996). Distinguishing missing at random and missing completely at random. The American Statistician, 50(3), 207–213.MathSciNetGoogle Scholar
  15. Jensen, F., Olesen, K., Andersen, S. (2006). An algebra of Bayesian belief universes for knowledge-based systems. Networks, 20(5), 637–659.CrossRefMathSciNetGoogle Scholar
  16. Jensen, F.V., & Nielsen, T.D. (2007). Bayesian networks and decision graphs. Springer.Google Scholar
  17. Khatri, H. (2006). Query processing over incomplete autonomous web databases. Master’s thesis, Arizona State University, Tempe, USA.Google Scholar
  18. Manning, C.D., Raghavan, P., Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press.Google Scholar
  19. Minka, T., Winn, J., Guiver, J., Knowles, D. (2010). Infer.NET 2.4. http://research.microsoft.com/infernet. Microsoft Research Cambridge. Accessed 1 Feb 2013.
  20. Minka, T.P. (2001). Expectation propagation for approximate Bayesian inference. In UAI (pp. 362–369).Google Scholar
  21. Murphy, K., et al. (2001). The Bayes net toolbox for Matlab. Computing Science and Statistics, 33(2), 1024–1034.Google Scholar
  22. Muslea, I., & Lee, T. (2005). Online query relaxation via Bayesian causal structures discovery. In Proceedings of the national conference on artificial intelligence (Vol. 20, p. 831). Menlo Park, CA/Cambridge, MA, London: AAAI Press/MIT Press.Google Scholar
  23. Nambiar, U., & Kambhampati, S. (2006). Answering imprecise queries over autonomous web databases. In Proceedings of the 22nd International Conference on Data Engineering (ICDE) (pp. 45–45). IEEE.Google Scholar
  24. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. Morgan Kaufmann.Google Scholar
  25. Ramoni, M., & Sebastiani, P. (1997). Learning Bayesian networks from incomplete databases. In Proceedings of the 13th conference on uncertainty in artificial intelligence (pp. 401–408). Morgan Kaufmann Publishers Inc.Google Scholar
  26. Ramoni, M., & Sebastiani, P. (2001). Robust learning with missing data. Machine Learning, 45(2), 147–170.CrossRefMATHGoogle Scholar
  27. Romero, V., & Salmerón, A. (2004). Multivariate imputation of qualitative missing data using Bayesian networks. In Soft methodology and random information systems (pp. 605–612). Springer.Google Scholar
  28. Russell, S.J., & Norvig, P. (2010). Artificial intelligence—a modern approach. Pearson Education.Google Scholar
  29. Shortliffe, E. (1976) Computer-based medical consultations: MYCIN (Vol. 388). Elsevier, New York.Google Scholar
  30. Wolf, G., Kalavagattu, A., Khatri, H., Balakrishnan, R., Chokshi, B., Fan, J., Chen, Y., Kambhampati, S. (2009). Query processing over incomplete autonomous databases: query rewriting using learned data dependencies. Very Large Data Bases Journal, 18(5), 1167–1190.CrossRefGoogle Scholar
  31. Wolf, G., Khatri, H., Chokshi, B., Fan, J., Chen, Y., Kambhampati, S. (2007). Query processing over incomplete autonomous databases. In Proceedings of the 33rd international conference on very large data bases (pp. 651–662). VLDB Endowment.Google Scholar
  32. Wu, C., Wun, C., Chou, H. (2004). Using association rules for completing missing data. In Hybrid Intelligent Systems (HIS) (pp. 236–241). IEEE.Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Rohit Raghunathan
    • 1
  • Sushovan De
    • 2
  • Subbarao Kambhampati
    • 2
  1. 1.AmazonSeattleUSA
  2. 2.Computer Science and EngineeringArizona State UniversityTempeUSA

Personalised recommendations