Abstract
As the information available to naïve users through autonomous data sources continues to increase, mediators become important to ensure that the wealth of information available is tapped effectively. A key challenge that these information mediators need to handle is the varying levels of incompleteness in the underlying databases in terms of missing attribute values. Existing approaches such as QPIAD aim to mine and use Approximate Functional Dependencies (AFDs) to predict and retrieve relevant incomplete tuples. These approaches make independence assumptions about missing values—which critically hobbles their performance when there are tuples containing missing values for multiple correlated attributes. In this paper, we present a principled probabilistic alternative that views an incomplete tuple as defining a distribution over the complete tuples that it stands for. We learn this distribution in terms of Bayesian networks. Our approach involves mining/“learning” Bayesian networks from a sample of the database, and using it to do both imputation (predict a missing value) and query rewriting (retrieve relevant results with incompleteness on the query-constrained attributes, when the data sources are autonomous). We present empirical studies to demonstrate that (i) at higher levels of incompleteness, when multiple attribute values are missing, Bayesian networks do provide a significantly higher classification accuracy and (ii) the relevant possible answers retrieved by the queries reformulated using Bayesian networks provide higher precision and recall than AFDs while keeping query processing costs manageable.
This is a preview of subscription content,
to check access.







Similar content being viewed by others
Notes
The actual implementation of QPIAD uses a variant to the highest confidence AFD for some of the attributes. For details we refer the reader to Wolf et al. (2009).
In this prototype, we manually transferred the output of the BANJO module to the BNT module. In future systems, we will integrate them programmatically.
References
Batista, G., & Monard, M. (2002). A study of k-nearest neighbour as an imputation method. In Soft computing systems: design, management and applications (pp. 251–260). Santiago, Chile.
Bishop, C., et al. (2006). Pattern recognition and machine learning (Vol. 4). Springer, New York.
Cars.com (2013). http://www.cars.com. Accessed 1 Feb 2013.
CBioC: (2013) http://cbioc.eas.asu.edu/. Accessed 1 Feb 2013.
Cooper, G. (1990). The computational complexity of probabilistic inference using Bayesian belief networks. Artificial Intelligence, 42(2), 393–405.
Dempster, A., Laird, N., Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1), 1–38.
Fernández, A., Rumí, R., Salmerón, A. (2012). Answering queries in hybrid Bayesian networks using importance sampling. Decision Support Systems, 53(3), 580–590.
Frank, A., & Asuncion, A. (2010). UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 1 Feb 2013.
Geiger, D., Verma, T., Pearl, J. (1990). Identifying independence in Bayesian networks. Networks, 20(5), 507–534.
Gupta, R., & Sarawagi, S. (2006). Creating probabilistic databases from information extraction models. In VLDB (pp. 965–976).
Hartemink, A., et al. (2005) Banjo: Bayesian network inference with java objects. Web Site http://www.cs.duke.edu/~amink/software/banjo/. Accessed 1 Feb 2013.
Heckerman, D. (1992). The certainty-factor model (2nd edn). Encyclopedia of Artificial Intelligence.
Heckerman, D., Geiger, D., Chickering, D.M. (1995). Learning Bayesian networks: the combination of knowledge and statistical data. Machine Learning, 20(3), 197–243.
Heitjan, D.F., & Basu, S. (1996). Distinguishing missing at random and missing completely at random. The American Statistician, 50(3), 207–213.
Jensen, F., Olesen, K., Andersen, S. (2006). An algebra of Bayesian belief universes for knowledge-based systems. Networks, 20(5), 637–659.
Jensen, F.V., & Nielsen, T.D. (2007). Bayesian networks and decision graphs. Springer.
Khatri, H. (2006). Query processing over incomplete autonomous web databases. Master’s thesis, Arizona State University, Tempe, USA.
Manning, C.D., Raghavan, P., Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press.
Minka, T., Winn, J., Guiver, J., Knowles, D. (2010). Infer.NET 2.4. http://research.microsoft.com/infernet. Microsoft Research Cambridge. Accessed 1 Feb 2013.
Minka, T.P. (2001). Expectation propagation for approximate Bayesian inference. In UAI (pp. 362–369).
Murphy, K., et al. (2001). The Bayes net toolbox for Matlab. Computing Science and Statistics, 33(2), 1024–1034.
Muslea, I., & Lee, T. (2005). Online query relaxation via Bayesian causal structures discovery. In Proceedings of the national conference on artificial intelligence (Vol. 20, p. 831). Menlo Park, CA/Cambridge, MA, London: AAAI Press/MIT Press.
Nambiar, U., & Kambhampati, S. (2006). Answering imprecise queries over autonomous web databases. In Proceedings of the 22nd International Conference on Data Engineering (ICDE) (pp. 45–45). IEEE.
Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. Morgan Kaufmann.
Ramoni, M., & Sebastiani, P. (1997). Learning Bayesian networks from incomplete databases. In Proceedings of the 13th conference on uncertainty in artificial intelligence (pp. 401–408). Morgan Kaufmann Publishers Inc.
Ramoni, M., & Sebastiani, P. (2001). Robust learning with missing data. Machine Learning, 45(2), 147–170.
Romero, V., & Salmerón, A. (2004). Multivariate imputation of qualitative missing data using Bayesian networks. In Soft methodology and random information systems (pp. 605–612). Springer.
Russell, S.J., & Norvig, P. (2010). Artificial intelligence—a modern approach. Pearson Education.
Shortliffe, E. (1976) Computer-based medical consultations: MYCIN (Vol. 388). Elsevier, New York.
Wolf, G., Kalavagattu, A., Khatri, H., Balakrishnan, R., Chokshi, B., Fan, J., Chen, Y., Kambhampati, S. (2009). Query processing over incomplete autonomous databases: query rewriting using learned data dependencies. Very Large Data Bases Journal, 18(5), 1167–1190.
Wolf, G., Khatri, H., Chokshi, B., Fan, J., Chen, Y., Kambhampati, S. (2007). Query processing over incomplete autonomous databases. In Proceedings of the 33rd international conference on very large data bases (pp. 651–662). VLDB Endowment.
Wu, C., Wun, C., Chou, H. (2004). Using association rules for completing missing data. In Hybrid Intelligent Systems (HIS) (pp. 236–241). IEEE.
Author information
Authors and Affiliations
Corresponding author
Additional information
This research is supported by ONR grant N000140910032 and two Google research awards.
Rights and permissions
About this article
Cite this article
Raghunathan, R., De, S. & Kambhampati, S. Bayesian networks for supporting query processing over incomplete autonomous databases. J Intell Inf Syst 42, 595–618 (2014). https://doi.org/10.1007/s10844-013-0277-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-013-0277-0