Skip to main content
Log in

Bayesian networks for supporting query processing over incomplete autonomous databases

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

As the information available to naïve users through autonomous data sources continues to increase, mediators become important to ensure that the wealth of information available is tapped effectively. A key challenge that these information mediators need to handle is the varying levels of incompleteness in the underlying databases in terms of missing attribute values. Existing approaches such as QPIAD aim to mine and use Approximate Functional Dependencies (AFDs) to predict and retrieve relevant incomplete tuples. These approaches make independence assumptions about missing values—which critically hobbles their performance when there are tuples containing missing values for multiple correlated attributes. In this paper, we present a principled probabilistic alternative that views an incomplete tuple as defining a distribution over the complete tuples that it stands for. We learn this distribution in terms of Bayesian networks. Our approach involves mining/“learning” Bayesian networks from a sample of the database, and using it to do both imputation (predict a missing value) and query rewriting (retrieve relevant results with incompleteness on the query-constrained attributes, when the data sources are autonomous). We present empirical studies to demonstrate that (i) at higher levels of incompleteness, when multiple attribute values are missing, Bayesian networks do provide a significantly higher classification accuracy and (ii) the relevant possible answers retrieved by the queries reformulated using Bayesian networks provide higher precision and recall than AFDs while keeping query processing costs manageable.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. The actual implementation of QPIAD uses a variant to the highest confidence AFD for some of the attributes. For details we refer the reader to Wolf et al. (2009).

  2. In this prototype, we manually transferred the output of the BANJO module to the BNT module. In future systems, we will integrate them programmatically.

References

  • Batista, G., & Monard, M. (2002). A study of k-nearest neighbour as an imputation method. In Soft computing systems: design, management and applications (pp. 251–260). Santiago, Chile.

  • Bishop, C., et al. (2006). Pattern recognition and machine learning (Vol. 4). Springer, New York.

    Google Scholar 

  • Cars.com (2013). http://www.cars.com. Accessed 1 Feb 2013.

  • CBioC: (2013) http://cbioc.eas.asu.edu/. Accessed 1 Feb 2013.

  • Cooper, G. (1990). The computational complexity of probabilistic inference using Bayesian belief networks. Artificial Intelligence, 42(2), 393–405.

    Article  MATH  MathSciNet  Google Scholar 

  • Dempster, A., Laird, N., Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1), 1–38.

    Google Scholar 

  • Fernández, A., Rumí, R., Salmerón, A. (2012). Answering queries in hybrid Bayesian networks using importance sampling. Decision Support Systems, 53(3), 580–590.

    Article  Google Scholar 

  • Frank, A., & Asuncion, A. (2010). UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 1 Feb 2013.

  • Geiger, D., Verma, T., Pearl, J. (1990). Identifying independence in Bayesian networks. Networks, 20(5), 507–534.

    Article  MATH  MathSciNet  Google Scholar 

  • Gupta, R., & Sarawagi, S. (2006). Creating probabilistic databases from information extraction models. In VLDB (pp. 965–976).

  • Hartemink, A., et al. (2005) Banjo: Bayesian network inference with java objects. Web Site http://www.cs.duke.edu/~amink/software/banjo/. Accessed 1 Feb 2013.

  • Heckerman, D. (1992). The certainty-factor model (2nd edn). Encyclopedia of Artificial Intelligence.

  • Heckerman, D., Geiger, D., Chickering, D.M. (1995). Learning Bayesian networks: the combination of knowledge and statistical data. Machine Learning, 20(3), 197–243.

    MATH  Google Scholar 

  • Heitjan, D.F., & Basu, S. (1996). Distinguishing missing at random and missing completely at random. The American Statistician, 50(3), 207–213.

    MathSciNet  Google Scholar 

  • Jensen, F., Olesen, K., Andersen, S. (2006). An algebra of Bayesian belief universes for knowledge-based systems. Networks, 20(5), 637–659.

    Article  MathSciNet  Google Scholar 

  • Jensen, F.V., & Nielsen, T.D. (2007). Bayesian networks and decision graphs. Springer.

  • Khatri, H. (2006). Query processing over incomplete autonomous web databases. Master’s thesis, Arizona State University, Tempe, USA.

  • Manning, C.D., Raghavan, P., Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press.

  • Minka, T., Winn, J., Guiver, J., Knowles, D. (2010). Infer.NET 2.4. http://research.microsoft.com/infernet. Microsoft Research Cambridge. Accessed 1 Feb 2013.

  • Minka, T.P. (2001). Expectation propagation for approximate Bayesian inference. In UAI (pp. 362–369).

  • Murphy, K., et al. (2001). The Bayes net toolbox for Matlab. Computing Science and Statistics, 33(2), 1024–1034.

    Google Scholar 

  • Muslea, I., & Lee, T. (2005). Online query relaxation via Bayesian causal structures discovery. In Proceedings of the national conference on artificial intelligence (Vol. 20, p. 831). Menlo Park, CA/Cambridge, MA, London: AAAI Press/MIT Press.

    Google Scholar 

  • Nambiar, U., & Kambhampati, S. (2006). Answering imprecise queries over autonomous web databases. In Proceedings of the 22nd International Conference on Data Engineering (ICDE) (pp. 45–45). IEEE.

  • Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. Morgan Kaufmann.

  • Ramoni, M., & Sebastiani, P. (1997). Learning Bayesian networks from incomplete databases. In Proceedings of the 13th conference on uncertainty in artificial intelligence (pp. 401–408). Morgan Kaufmann Publishers Inc.

  • Ramoni, M., & Sebastiani, P. (2001). Robust learning with missing data. Machine Learning, 45(2), 147–170.

    Article  MATH  Google Scholar 

  • Romero, V., & Salmerón, A. (2004). Multivariate imputation of qualitative missing data using Bayesian networks. In Soft methodology and random information systems (pp. 605–612). Springer.

  • Russell, S.J., & Norvig, P. (2010). Artificial intelligence—a modern approach. Pearson Education.

  • Shortliffe, E. (1976) Computer-based medical consultations: MYCIN (Vol. 388). Elsevier, New York.

    Google Scholar 

  • Wolf, G., Kalavagattu, A., Khatri, H., Balakrishnan, R., Chokshi, B., Fan, J., Chen, Y., Kambhampati, S. (2009). Query processing over incomplete autonomous databases: query rewriting using learned data dependencies. Very Large Data Bases Journal, 18(5), 1167–1190.

    Article  Google Scholar 

  • Wolf, G., Khatri, H., Chokshi, B., Fan, J., Chen, Y., Kambhampati, S. (2007). Query processing over incomplete autonomous databases. In Proceedings of the 33rd international conference on very large data bases (pp. 651–662). VLDB Endowment.

  • Wu, C., Wun, C., Chou, H. (2004). Using association rules for completing missing data. In Hybrid Intelligent Systems (HIS) (pp. 236–241). IEEE.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sushovan De.

Additional information

This research is supported by ONR grant N000140910032 and two Google research awards.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Raghunathan, R., De, S. & Kambhampati, S. Bayesian networks for supporting query processing over incomplete autonomous databases. J Intell Inf Syst 42, 595–618 (2014). https://doi.org/10.1007/s10844-013-0277-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-013-0277-0

Keywords

Navigation