Abstract
We study the evaluation of positive conjunctive queries with Boolean aggregate tests (similar to HAVING queries in SQL) on probabilistic databases. Our motivation is to handle aggregate queries over imprecise data resulting from information integration or information extraction. More precisely, we study conjunctive queries with predicate aggregates using MIN,MAX,COUNT, SUM,AVG or COUNT(DISTINCT) on probabilistic databases. Computing the precise output probabilities for positive conjunctive queries (without HAVING) is #\({\mathcal {P}}\)-hard, but is in \({\mathcal {P}}\) for a restricted class of queries called safe queries. Further, for queries without self-joins either a query is safe or its data complexity is #\({\mathcal {P}}\)-Hard, which shows that safe queries exactly capture tractable queries without self-joins. In this paper, for each aggregate above, we find a class of queries that exactly capture efficient evaluation for HAVING queries without self-joins. Our algorithms use a novel technique to compute the marginal distributions of elements in a semiring, which may be of independent interest.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Reading (1995)
Arenas, M., Bertossi, L., Chomicki, J., He, X., Raghavan, V., Spinrad, J.: Scalar aggregation in inconsistent databases. Theoretical Computer Science (2003)
Barbara, D., Garcia-Molina, H., Porter, D.: The management of probabilistic data. IEEE Trans. Knowl. Data Eng. 4(5), 487–502 (1992)
Burdick, D., Deshpande, P.M., Jayram, T.S., Ramakrishnan, R., Vaithyanathan, S.: Olap over uncertain and imprecise data. VLDB J. 16(1), 123–144 (2007)
Cafarella, M.J., Ré, C., Suciu, D., Etzioni, O.: Structured querying of web text data: A technical challenge. In: CIDR, pp. 225–234 (2007), http://www.crdrdb.org
Cheng, R., Kalashnikov, D., Prabhakar, S.: Evaluating probabilistic queries over imprecise data. In: Proceedings of ACM SIGMOD Conference, ACM Press, New York (2003)
Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. In: VLDB, Toronto, Canada (2004)
Dalvi, N., Suciu, D.: Management of probabilisitic data: Foundations and challenges. In: PODS, pp. 1–12 (2007)
Deshpande, A., Guestrin, C., Madden, S., Hellerstein, J., Hong, W.: Model-driven data acquisition in sensor networks (2004)
Fuxman, A., Miller, R.J.: First-order query rewriting for inconsistent databases. In: ICDT, pp. 337–351 (2005)
Gradel, E., Gurevich, Yu., Hirch, C.: The complexity of query reliability. In: Symposium on Principles of Database Systems, pp. 227–234 (1998)
Green, T., Karvounarakis, G., Tannen, V.: Provenance semirings. In: PODS (2007)
Green, T.J., Tannen, V.: Models for incomplete and probabilistic information. IEEE Data Engineering Bulletin 29 (2006)
Gupta, R., Sarawagi, S.: Curating probabilistic databases from information extraction models. In: Proc. of the 32nd Int’l. Conference on Very Large Databases (VLDB) (2006)
Hernandez, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: SIGMOD Conference, pp. 127–138 (1995)
Jayram, T.S., Kale, S., Vee, E.: Efficient aggregation algorithms for probabilistic data. In: SODA (2007)
Jayram, T.S., Krishnamurthy, R., Raghavan, S., Vaithyanathan, S., Zhu, H.: Avatar information extraction system. IEEE Data Engineering Bulletin 29(1) (2006)
Lakshmanan, L., Leone, N., Ross, R., Subrahmanian, V.S.: Probview: A flexible probabilistic database system. ACM Trans. Database Syst. 22(3) (1997)
Mansuri, I., Sarawagi, S.: A system for integrating unstructured data into relational databases. In: Proc. of the 22nd IEEE Int’l. Conference on Data Engineering (ICDE), IEEE Computer Society Press, Los Alamitos (2006)
Parag, A., Benjelloun, O., Sarma, A.D., Hayworth, C., Nabar, S., Sugihara, T., Widom, J.: Trio: A system for data uncertainty and lineage. In: VLDB (2006)
Ré, C., Dalvi, N., Suciu, D.: Query evaluation on probabilistic databases. IEEE Data Engineering Bulletin 29(1), 25–31 (2006)
Ré, C., Dalvi, N., Suciu, D.: Efficient top-k query evaluation on probabilistic data. In: Proceedings of ICDE (2007)
Ré, C., Suciu, D.: Efficient evaluation of having queries on a probabilistic database. Technical Report TR2007-06-01, University of Washington, Seattle, Washington (June 2007)
Ré, C., Suciu, D.: Materialized views in probabilsitic databases for information exchange and query optimization. In: VLDB (2007)
Ross, R., Subrahmanian, V.S., Grant, J.: Aggregate operators in probabilistic databases. J. ACM 52(1), 54–101 (2005)
Sarma, A.D., Benjelloun, O., Halevy, A.Y., Widom, J.: Working models for uncertain data. In: Liu, L., Reuter, A., Whang, K.-Y., Zhang, J. (eds.) ICDE, p. 7. IEEE Computer Society Press, Los Alamitos (2006)
Sen, P., Deshpande, A.: Representing and querying correlated tuples in probabilistic databases. In: Proceedings of ICDE (2007)
Valiant, L.G.: The complexity of enumeration and reliability problems. SIAM J. Comput. 8(3), 410–421 (1979)
Widom, J.: Trio: A system for integrated management of data, accuracy, and lineage. In: CIDR, pp. 262–276 (2005)
Winkler, W.E.: Improved decision rules in the fellegi-sunter model of record linkage. Technical report, Statistical Research Division, U.S. Census Bureau, Washington, DC (1993)
Winkler, W.E.: The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Bureau of the Census (1999)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ré, C., Suciu, D. (2007). Efficient Evaluation of HAVING Queries on a Probabilistic Database. In: Arenas, M., Schwartzbach, M.I. (eds) Database Programming Languages. DBPL 2007. Lecture Notes in Computer Science, vol 4797. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75987-4_13
Download citation
DOI: https://doi.org/10.1007/978-3-540-75987-4_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-75986-7
Online ISBN: 978-3-540-75987-4
eBook Packages: Computer ScienceComputer Science (R0)