Bayesian networks for supporting query processing over incomplete autonomous databases
As the information available to naïve users through autonomous data sources continues to increase, mediators become important to ensure that the wealth of information available is tapped effectively. A key challenge that these information mediators need to handle is the varying levels of incompleteness in the underlying databases in terms of missing attribute values. Existing approaches such as QPIAD aim to mine and use Approximate Functional Dependencies (AFDs) to predict and retrieve relevant incomplete tuples. These approaches make independence assumptions about missing values—which critically hobbles their performance when there are tuples containing missing values for multiple correlated attributes. In this paper, we present a principled probabilistic alternative that views an incomplete tuple as defining a distribution over the complete tuples that it stands for. We learn this distribution in terms of Bayesian networks. Our approach involves mining/“learning” Bayesian networks from a sample of the database, and using it to do both imputation (predict a missing value) and query rewriting (retrieve relevant results with incompleteness on the query-constrained attributes, when the data sources are autonomous). We present empirical studies to demonstrate that (i) at higher levels of incompleteness, when multiple attribute values are missing, Bayesian networks do provide a significantly higher classification accuracy and (ii) the relevant possible answers retrieved by the queries reformulated using Bayesian networks provide higher precision and recall than AFDs while keeping query processing costs manageable.
KeywordsData cleaning Bayesian networks Query rewriting Autonomous database
- Batista, G., & Monard, M. (2002). A study of k-nearest neighbour as an imputation method. In Soft computing systems: design, management and applications (pp. 251–260). Santiago, Chile.Google Scholar
- Bishop, C., et al. (2006). Pattern recognition and machine learning (Vol. 4). Springer, New York.Google Scholar
- Cars.com (2013). http://www.cars.com. Accessed 1 Feb 2013.
- CBioC: (2013) http://cbioc.eas.asu.edu/. Accessed 1 Feb 2013.
- Dempster, A., Laird, N., Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1), 1–38.Google Scholar
- Frank, A., & Asuncion, A. (2010). UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 1 Feb 2013.
- Gupta, R., & Sarawagi, S. (2006). Creating probabilistic databases from information extraction models. In VLDB (pp. 965–976).Google Scholar
- Hartemink, A., et al. (2005) Banjo: Bayesian network inference with java objects. Web Site http://www.cs.duke.edu/~amink/software/banjo/. Accessed 1 Feb 2013.
- Heckerman, D. (1992). The certainty-factor model (2nd edn). Encyclopedia of Artificial Intelligence.Google Scholar
- Jensen, F.V., & Nielsen, T.D. (2007). Bayesian networks and decision graphs. Springer.Google Scholar
- Khatri, H. (2006). Query processing over incomplete autonomous web databases. Master’s thesis, Arizona State University, Tempe, USA.Google Scholar
- Manning, C.D., Raghavan, P., Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press.Google Scholar
- Minka, T., Winn, J., Guiver, J., Knowles, D. (2010). Infer.NET 2.4. http://research.microsoft.com/infernet. Microsoft Research Cambridge. Accessed 1 Feb 2013.
- Minka, T.P. (2001). Expectation propagation for approximate Bayesian inference. In UAI (pp. 362–369).Google Scholar
- Murphy, K., et al. (2001). The Bayes net toolbox for Matlab. Computing Science and Statistics, 33(2), 1024–1034.Google Scholar
- Muslea, I., & Lee, T. (2005). Online query relaxation via Bayesian causal structures discovery. In Proceedings of the national conference on artificial intelligence (Vol. 20, p. 831). Menlo Park, CA/Cambridge, MA, London: AAAI Press/MIT Press.Google Scholar
- Nambiar, U., & Kambhampati, S. (2006). Answering imprecise queries over autonomous web databases. In Proceedings of the 22nd International Conference on Data Engineering (ICDE) (pp. 45–45). IEEE.Google Scholar
- Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. Morgan Kaufmann.Google Scholar
- Ramoni, M., & Sebastiani, P. (1997). Learning Bayesian networks from incomplete databases. In Proceedings of the 13th conference on uncertainty in artificial intelligence (pp. 401–408). Morgan Kaufmann Publishers Inc.Google Scholar
- Romero, V., & Salmerón, A. (2004). Multivariate imputation of qualitative missing data using Bayesian networks. In Soft methodology and random information systems (pp. 605–612). Springer.Google Scholar
- Russell, S.J., & Norvig, P. (2010). Artificial intelligence—a modern approach. Pearson Education.Google Scholar
- Shortliffe, E. (1976) Computer-based medical consultations: MYCIN (Vol. 388). Elsevier, New York.Google Scholar
- Wolf, G., Khatri, H., Chokshi, B., Fan, J., Chen, Y., Kambhampati, S. (2007). Query processing over incomplete autonomous databases. In Proceedings of the 33rd international conference on very large data bases (pp. 651–662). VLDB Endowment.Google Scholar
- Wu, C., Wun, C., Chou, H. (2004). Using association rules for completing missing data. In Hybrid Intelligent Systems (HIS) (pp. 236–241). IEEE.Google Scholar