Selectivity estimation with density-model-based multidimensional histogram

Abstract

Histograms are widely used in selectivity estimation for one-dimensional data. Using the one-dimensional histograms to estimate the selectivity of the multidimensional queries will result in a high estimation error, unless the assumption of attribute independence is true. Constructing a multidimensional histogram also brings great challenges. The storage of a multidimensional histogram exponentially increases with the number of dimensions. In this paper, we propose a density-model-based multidimensional histogram. It uses a lightweight density model to predict the densities of a large number of regions instead of storing too many buckets. The experimental results indicate that our method can provide highly accurate selectivity estimations while occupying little space. In addition, the superiority of our method is more evident in high-dimensional data.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Notes

  1. 1.

    http://archive.ics.uci.edu/ml/datasets/3D+Road+Network+(North+Jutland,+Denmark).

  2. 2.

    http://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption.

References

  1. 1.

    Aboulnaga A, Chaudhuri S (1999) Self-tuning histograms: building histograms without looking at data. In: SIGMOD 1999, proceedings ACM SIGMOD international conference on management of data, 1–3 June 1999, Philadelphia, Pennsylvania, USA, pp 181–192

  2. 2.

    Acharya S, Poosala V, Ramaswamy S (1999) Selectivity estimation in spatial databases. In: SIGMOD 1999, proceedings ACM SIGMOD international conference on management of data, 1–3 June 1999, Philadelphia, Pennsylvania, USA, pp 13–24

  3. 3.

    Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

  4. 4.

    Bruno N, Chaudhuri S, Gravano L (2001) Stholes: a multidimensional workload-aware histogram. In: Proceedings of the 2001 ACM SIGMOD international conference on Management of data, Santa Barbara, CA, USA, 21–24 May 2001, pp 211–222

  5. 5.

    Chaudhuri S, Narasayya VR (2007) Self-tuning database systems: a decade of progress. In: Koch C, Gehrke J, Garofalakis MN, Srivastava D, Aberer K, Deshpande A, Florescu D, Chan CY, Ganti V, Kanne C-C, Klas W, Neuhold EJ (eds) Proceedings of the 33rd international conference on very large data bases, University of Vienna, Austria, 23–27 Sept 2007. ACM, pp 3–14

  6. 6.

    Cormode G, Garofalakis MN, Haas PJ, Jermaine C (2012) Synopses for massive data: samples, histograms, wavelets, sketches. Found Trends Databases 4(1–3):1–294

    MATH  Google Scholar 

  7. 7.

    Dutt A, Wang C, Nazi A, Kandula S, Narasayya VR, Chaudhuri S (2019) Selectivity estimation for range predicates using lightweight models. Proc VLDB Endow 12(9):1044–1057

    Article  Google Scholar 

  8. 8.

    Gao B, Liu N, Wang X, Lan M, Zhao Z, Dellandréa E, Chen L (2018) A method to accelerate k-means and GMM computation with GPU and multi-core CPU. In: Fourth IEEE international conference on multimedia big data, BigMM 2018, Xi’an, China, 13–16 Sept 2018. IEEE, pp 1–5

  9. 9.

    Guha S, Koudas N, Shim K (2006) Approximation and streaming algorithms for histogram construction problems. ACM Trans Database Syst 31(1):396–438

    Article  Google Scholar 

  10. 10.

    Gunopulos D, Kollios G, Tsotras VJ, Domeniconi C (2000) Approximating multi-dimensional aggregate range queries over real attributes. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, 16–18 May 2000, Dallas, Texas, USA, pp 463–474

  11. 11.

    Gunopulos D, Kollios G, Tsotras VJ, Domeniconi C (2005) Selectivity estimators for multidimensional range queries over real attributes. VLDB J 14(2):137–154

    Article  Google Scholar 

  12. 12.

    Hasan S, Thirumuruganathan S, Augustine J, Koudas N, Das G (2020) Deep learning models for selectivity estimation of multi-attribute queries. In: Maier D, Pottinger R, Doan A, Tan W-C, Alawini A, Ngo HQ (eds) Proceedings of the 2020 international conference on management of data, SIGMOD conference 2020, online conference [Portland, OR, USA], 14–19 June 2020. ACM, pp 1035–1050

  13. 13.

    Heimel M, Kiefer M, Markl V (2015) Self-tuning, GPU-accelerated kernel density models for multidimensional selectivity estimation. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, Melbourne, Victoria, Australia, May 31–June 4, 2015, pp 1477–1492

  14. 14.

    Hilprecht B, Schmidt A, Kulessa M, Molina A, Kersting K, Binnig C (2020) Deepdb: Learn from data, not from queries!. Proc VLDB Endow 13(7):992–1005

    Article  Google Scholar 

  15. 15.

    Ioannidis YE (2003) The history of histograms (abridged). In: VLDB 2003, proceedings of 29th international conference on very large data bases, 9–12 Sept 2003, Berlin, Germany, pp 19–30

  16. 16.

    Ioannidis YE, Poosala V (1995) Balancing histogram optimality and practicality for query result size estimation. In: Proceedings of the 1995 ACM SIGMOD international conference on management of data, San Jose, California, 22–25 May 1995, pp 233–244

  17. 17.

    Kaushik R, Suciu D (2009) Consistent histograms in the presence of distinct value counts. Proc VLDB Endow 2(1):850–861

    Article  Google Scholar 

  18. 18.

    Khachatryan A, Müller E, Böhm K, Stier C (2016) Improving accuracy and robustness of self-tuning histograms by subspace clustering. In: 32nd IEEE international conference on data engineering, ICDE 2016, Helsinki, Finland, 16–20 May 2016, pp 1544–1545

  19. 19.

    Kipf A, Kipf T, Radke B, Leis V, Boncz PA, Kemper A (2019) Learned cardinalities: estimating correlated joins with deep learning. In: CIDR 2019, 9th Biennial conference on innovative data systems research, Asilomar, CA, USA, 13–16 Jan 2019, online proceedings. www.cidrdb.org

  20. 20.

    Kooi R (September 1980) The optimization of queries in relational databases. Ph.D. thesis, Case Western Reserve University

  21. 21.

    Low JS, Ghafoori Z, Bezdek JC, Leckie C (2019) Seeding on samples for accelerating k-means clustering. In: Proceedings of the 3rd international conference on big data and internet of things, BDIOT 2019, La Trobe University, Melbourne, VIC, Australia, 22–24 Aug 2019. ACM, pp 41–45

  22. 22.

    Matias Y, Vitter JS, Wang M (2000) Dynamic maintenance of wavelet-based histograms. In: VLDB 2000, proceedings of 26th international conference on very large data bases, 10–14 Sept 2000, Cairo, Egypt, pp 101–110

  23. 23.

    Moerkotte G, Neumann T, Steidl G (2009) Preventing bad plans by bounding the impact of cardinality estimation errors. Proc VLDB Endow 2(1):982–993

    Article  Google Scholar 

  24. 24.

    Müller M, Moerkotte G, Kolb O (2018) Improved selectivity estimation by combining knowledge from sampling and synopses. PVLDB 11(9):1016–1028

    Google Scholar 

  25. 25.

    Muralikrishna M, DeWitt DJ (1988) Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In: Proceedings of the 1988 ACM SIGMOD international conference on management of data, Chicago, Illinois, USA, 1–3 June 1988, pp 28–36

  26. 26.

    Muthukrishnan S, Poosala V, Suel T (1999) On rectangular partitionings in two dimensions: algorithms, complexity, and applications. In: Database theory—ICDT ’99, 7th international conference, Jerusalem, Israel, 10–12 Jan 1999, Proceedings, pp 236–256

  27. 27.

    Park Y, Zhong S, Mozafari B (2020) Quicksel: quick selectivity learning with mixture models. In: Maier D, Pottinger R, Doan A, Tan W-C, Alawini A, Ngo HQ (eds) Proceedings of the 2020 international conference on management of data, SIGMOD Conference 2020, online conference [Portland, OR, USA], 14–19 June 2020. ACM, pp 1017–1033

  28. 28.

    Piatetsky-Shapiro G, Connell C (1984) Accurate estimation of the number of tuples satisfying a condition. In: SIGMOD’84, proceedings of annual meeting, Boston, Massachusetts, USA, 18–21 June 1984, pp 256–276

  29. 29.

    Poosala V, Ioannidis YE (1996) Estimation of query-result distribution and its application in parallel-join load balancing. In: VLDB’96, proceedings of 22th international conference on very large data bases, 3–6 Sept 1996, Mumbai (Bombay), India, pp 448–459

  30. 30.

    Poosala V, Ioannidis YE (1997) Selectivity estimation without the attribute value independence assumption. In: VLDB’97, proceedings of 23rd international conference on very large data bases, 25–29 Aug 1997, Athens, Greece, pp 486–495

  31. 31.

    Reuther A, Michaleas P, Jones M, Gadepally V, Samsi S, Kepner J (2019) Survey and benchmarking of machine learning accelerators. In: 2019 IEEE high performance extreme computing conference, HPEC 2019, Waltham, MA, USA, 24–26 Sept 2019. IEEE, pp 1–9

  32. 32.

    Sculley D (2010) Web-scale k-means clustering. In: Proceedings of the 19th international conference on world wide web, WWW 2010, Raleigh, North Carolina, USA, 26–30 April 2010, pp 1177–1178

  33. 33.

    Shekelyan M, Dignös A, Gamper J (2017) Digithist: a histogram-based data summary with tight error bounds. PVLDB 10(11):1514–1525

    Google Scholar 

  34. 34.

    Wu Y-L, Agrawal D, El Abbadi A (2002) Query estimation by adaptive sampling. In: Proceedings of the 18th international conference on data engineering, San Jose, CA, USA, February 26–March 1, 2002, pp 639–648

  35. 35.

    Yang Z, Liang E, Kamsetty A, Chenggang W, Duan Y, Chen P, Abbeel P, Hellerstein JM, Krishnan S, Stoica I (2019) Deep unsupervised cardinality estimation. Proc VLDB Endow 13(3):279–292

    Article  Google Scholar 

  36. 36.

    Yildiz B, Büyüktanir T, Emekçi F (2016) Equi-depth histogram construction for big data with quality guarantees. CoRR, arXiv:1606.05633

Download references

Acknowledgements

This paper was supported by NSFC Grant U1866602 and CCF-Huawei Database System Innovation Research Plan CCF-HuaweiDBIR2020007B.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Hongzhi Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhang, M., Wang, H. Selectivity estimation with density-model-based multidimensional histogram. Knowl Inf Syst (2021). https://doi.org/10.1007/s10115-021-01547-7

Download citation

Keywords

  • Selectivity estimation
  • Multidimensional histogram
  • Query processing