EntropyDB: a probabilistic approach to approximate query processing

  • Laurel OrrEmail author
  • Magdalena Balazinska
  • Dan Suciu
Special Issue Paper


We present, an interactive data exploration system that uses a probabilistic approach to generate a small, query-able summary of a dataset. Departing from traditional summarization techniques, we use the Principle of Maximum Entropy to generate a probabilistic representation of the data that can be used to give approximate query answers. We develop the theoretical framework and formulation of our probabilistic representation and show how to use it to answer queries. We then present solving techniques, give two critical optimizations to improve preprocessing time and query execution time, and explore methods to reduce query error. Lastly, we experimentally evaluate our work using a 5 GB dataset of flights within the USA and a 210 GB dataset from an astronomy particle simulation. While our current work only supports linear queries, we show that our technique can successfully answer queries faster than sampling while introducing, on average, no more error than sampling and can better distinguish between rare and nonexistent values. We also discuss extensions that can allow for data updates and linear queries over joins.


Database summarization Approximate query processing Principle of maximum entropy Data exploration Probabilistic databases Graphical models 



This work is supported by NSF 1614738 and NSF 1535565. Laurel Orr is supported by the NSF Graduate Research Fellowship.


  1. 1.
    Acharya, S., Gibbons, P.B., Poosala, V.: Congressional samples for approximate answering of group-by queries. In: ACM Sigmod Record, vol. 29, pp. 487–498. ACM (2000)Google Scholar
  2. 2.
    Acharya, S., Gibbons, P.B., Poosala, V., Ramaswamy, S.: The aqua approximate query answering system. In: ACM Sigmod Record, vol. 28, pp. 574–576. ACM (1999)Google Scholar
  3. 3.
    Agarwal, S., et al.: Blinkdb: queries with bounded errors and bounded response times on very large data. In: Proceedings of EuroSys’13, pp. 29–42 (2013)Google Scholar
  4. 4.
    Applegate, D.A., Calinescu, G., Johnson, D.S., Karloff, H., Ligett, K., Wang, J.: Compressing rectilinear pictures and minimizing access control lists. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete AlgorithmsGoogle Scholar
  5. 5.
    Babcock, B., Chaudhuri, S., Das, G.: Dynamic sample selection for approximate query processing. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 539–550 (2003)Google Scholar
  6. 6.
    Bar-Yossef, Z., Jayram, T., Kumar, R., Sivakumar, D., Trevisan, L.: Counting distinct elements in a data stream. In: International Workshop on Randomization and Approximation Techniques in Computer Science, pp. 1–10. Springer (2002)Google Scholar
  7. 7.
    Behrisch, M., Bach, B., Henry Riche, N., Schreck, T., Fekete, J.-D.: Matrix reordering methods for table and network visualization. In: Computer Graphics Forum, vol. 35, pp. 693–716. Wiley Online Library (2016)Google Scholar
  8. 8.
    Bekker, J., Davis, J., Choi, A., Darwiche, A., Van den Broeck, G.: Tractable learning for complex probability queries. In: Advances in Neural Information Processing Systems, pp. 2242–2250 (2015)Google Scholar
  9. 9.
    Berger, A.L., Pietra, V.J.D., Pietra, S.A.D.: A maximum entropy approach to natural language processing. Comput. Linguist. 22(1), 39–71 (1996)Google Scholar
  10. 10.
    Bubeck, S.: Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. 8(3–4), 231–357 (2015) CrossRefGoogle Scholar
  11. 11.
    Chakrabarti, K., Garofalakis, M., Rastogi, R., Shim, K.: Approximate query processing using wavelets. VLDB J. Int. J. Very Large Data Bases 10(2–3), 199–223 (2001)zbMATHGoogle Scholar
  12. 12.
    Chaudhuri, S., Das, G., Narasayya, V.: A robust, optimization-based approach for approximate answering of aggregate queries. ACM SIGMOD Rec. 30, 295–306 (2001)CrossRefGoogle Scholar
  13. 13.
    Chaudhuri, S., Ding, B., Kandula, S.: Approximate query processing: No silver bullet. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 511–519. ACM (2017)Google Scholar
  14. 14.
    Chow, C., Liu, C.: Approximating discrete probability distributions with dependence trees. IEEE Trans. Inf. Theory 14(3), 462–467 (1968)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Cormode, G., Garofalakis, M., Haas, P.J., Jermaine, C., et al.: Synopses for massive data: samples, histograms, wavelets, sketches. Found. Trends® Databases 4(1–3), 1–294 (2011)zbMATHGoogle Scholar
  16. 16.
    Crotty, A., Galakatos, A., Zgraggen, A., Binnig, C., Kraska, T.: The case for interactive data exploration accelerators (ideas). In: Proceedings of the Workshop on Human-In-the-Loop Data Analytics, p. 11. ACM (2016)Google Scholar
  17. 17.
    Dalvi, N., Ré, C., Suciu, D.: Probabilistic databases: diamonds in the dirt. Commun. ACM 52(7), 86–94 (2009)CrossRefGoogle Scholar
  18. 18.
    Deshpande, A., Garofalakis, M.N., Rastogi, R.: Independence is good: dependency-based histogram synopses for high-dimensional data. In: SIGMOD Conference (2001)Google Scholar
  19. 19.
    Ding, B., Huang, S., Chaudhuri, S., Chakrabarti, K., Wang, C.: Sample+seek: approximating aggregates with distribution precision guarantee. In: Proceedings of SIGMOD, pp. 679–694 (2016)Google Scholar
  20. 20.
    Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., Roth, A.: Generalization in adaptive data analysis and holdout reuse. In: Advances in Neural Information Processing Systems, pp. 2350–2358 (2015)Google Scholar
  21. 21.
    Galakatos, A., Crotty, A., Zgraggen, E., Binnig, C., Kraska, T.: Revisiting reuse for approximate query processing. Proc. VLDB Endow. 10(10), 1142–1153 (2017)CrossRefGoogle Scholar
  22. 22.
    Hardt, M., Rothblum, G.N.: A multiplicative weights mechanism for privacy-preserving data analysis. In: 2010 51st Annual IEEE Symposium on, Foundations of Computer Science (FOCS), pp. 61–70. IEEE (2010)Google Scholar
  23. 23.
    Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: ACM SIGMOD Record, vol. 26, pp. 171–182. ACM (1997)Google Scholar
  24. 24.
    Hosangadi, A., Fallah, F., Kastner, R.: Factoring and eliminating common subexpressions in polynomial expressions. In: IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004 (2004)Google Scholar
  25. 25.
    Jermaine, C., Arumugam, S., Pol, A., Dobra, A.: Scalable approximate query processing with the DBO engine. ACM Trans. Database Syst. (TODS) 33(4), 23 (2008)CrossRefGoogle Scholar
  26. 26.
  27. 27.
    Jetley, P. et al.: Massively parallel cosmological simulations with ChaNGa. In: Proceedings of IPDPS (2008)Google Scholar
  28. 28.
    Jordan, M.: An introduction to probabilistic graphical models (2003). Accessed 10 Nov 2018
  29. 29.
    Kandula, S., Shanbhag, A., Vitorovic, A., Olma, M., Grandl, R., Chaudhuri, S., Ding, B.: Quickr: lazily approximating complex adhoc queries in bigdata clusters. In: Proceedings of the 2016 International Conference on Management of Data, pp. 631–646. ACM (2016)Google Scholar
  30. 30.
    Kipf, A., Kipf, T., Radke, B., Leis, V., Boncz, P., Kemper, A.: Learned cardinalities: estimating correlated joins with deep learning (2018). arXiv preprint arXiv:1809.00677
  31. 31.
    Li, C., et al.: Optimizing linear counting queries under differential privacy. In: Proceedings of PODS, pp. 123–134 (2010)Google Scholar
  32. 32.
    Li, K., Li, G.: Approximate query processing: what is new and where to go? Data Sci. Eng. 3(4), 379–397 (2018)CrossRefGoogle Scholar
  33. 33.
    Li, K., Zhang, Y., Li, G., Tao, W., Yan, Y.: Bounded approximate query processing. IEEE Trans. Knowl. Data Eng. (2018).
  34. 34.
    Mäkinen, E., Siirtola, H.: Reordering the reorderable matrix as an algorithmic problem. In: International Conference on Theory and Application of Diagrams, pp. 453–468. Springer (2000)Google Scholar
  35. 35.
    Markl, V., et al.: Consistently estimating the selectivity of conjuncts of predicates. In: Proceedings of VLDB, pp. 373–384. VLDB Endowment (2005)Google Scholar
  36. 36.
    Mozafari, B., Niu, N.: A handbook for building an approximate query engine. IEEE Data Eng. Bull. 38(3), 3–29 (2015)Google Scholar
  37. 37.
    Murphy, K.: Undirected graphical models (2006). Accessed 19 Nov 2018
  38. 38.
    Orr, L., Balazinska, M., Suciu, D.: Probabilistic database summarization for interactive data exploration. Proc. VLDB Endow. 10(10), 1154–1165 (2017)CrossRefGoogle Scholar
  39. 39.
    Ortiz, J., Balazinska, M., Gehrke, J., Keerthi, S.S.: Learning state representations for query optimization with deep reinforcement learning (2018). arXiv preprint arXiv:1803.08604
  40. 40.
    Park, Y., Mozafari, B., Sorenson, J., Wang, J.: Verdictdb: universalizing approximate query processing. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1461–1476. ACM (2018)Google Scholar
  41. 41.
    Peng, J., Zhang, D., Wang, J., Pei, J.: Aqp++: connecting approximate query processing with aggregate precomputation for interactive analytics. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1477–1492. ACM (2018)Google Scholar
  42. 42.
    Ré, C., Suciu, D.: Understanding cardinality estimation using entropy maximization. ACM TODS 37(1), 6 (2012)CrossRefGoogle Scholar
  43. 43.
    Suciu, D., Olteanu, D., Ré, C., Koch, C.: Probabilistic databases. Synth. Lect. Data Manag. 3(2), 1–180 (2011)CrossRefGoogle Scholar
  44. 44.
    Teh, Y.W., Welling, M.: On improving the efficiency of the iterative proportional fitting procedure. In: AIStats (2003)Google Scholar
  45. 45.
    Thirumuruganathan, S., Hasan, S., Koudas, N., Das, G.: Approximate query processing using deep generative models (2019). arXiv preprint arXiv:1903.10000
  46. 46.
    Tzoumas, K., Deshpande, A., Jensen, C.S.: Efficiently adapting graphical models for selectivity estimation. VLDB J. 22(1), 3–27 (2013)CrossRefGoogle Scholar
  47. 47.
    Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn. 1(1–2), 1–305 (2008)CrossRefGoogle Scholar
  48. 48.
    Wu, M., Jermaine, C.: A Bayesian method for guessing the extreme values in a data set? In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 471–482. VLDB Endowment (2007)Google Scholar
  49. 49.
    Yang, E., Ravikumar, P., Allen, G.I., Liu, Z.: Graphical models via univariate exponential family distributions. J. Mach. Learn. Res. 16(1), 3813–3847 (2015)MathSciNetzbMATHGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.University of WashingtonSeattleUSA

Personalised recommendations