Optimizing Performance of Aggregate Query Processing with Histogram Data Structure

Yong, Liang; Zhaonan, Mu

doi:10.1007/978-3-030-19807-7_33

Liang Yong¹⁵ &
Mu Zhaonan¹⁵

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 984))

Included in the following conference series:

Computer Science On-line Conference

606 Accesses

Abstract

In today’s big data era, the capability of analyze massive data efficient and return the results within an short time limit is critical to decision making, thus many big data system proposed and various distributed and parallel processing techniques are heavily investigated. Among previous research, most of them are working on precise query processing, while approximate query processing (AQP) techniques which make interactive data exploration more efficiently and allows users to tradeoff between query accuracy and response time have not been investigate comprehensively. In this paper, we study the characteristics of aggregate query, a typical type of analytical query, and proposed an approximate query processing approach to optimize the execution of massive data based aggregate query with a histogram data structure. We implemented this approach into big data system Hive and compare it with Hive and AQP-enabled big data system BlinkDB, the experimental results verified that our approach is significantly fast than these existing systems in most scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Anthony, S., Liu, H., Murthy, R.: Hive - a petabyte scale data warehouse using Hadoop. In: ICDE, pp. 996–1005 (2010)
Google Scholar
Huai, Y., Chauhan, A., Gates, A., Hagleitner, G., Hanson, E.N., O’Malley, O., Pandey, J., Yuan, Y., Lee, R., Zhang, X.: Major technical advancements in apache hive. In: SIGMOD Conference 2014, pp. 1235–1246 (2014)
Google Scholar
Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., Stoica, I.: BlinkDB: queries with bounded errors and bounded response times on very large data. In: EuroSys 2013, pp. 29–42 (2013)
Google Scholar
Agarwal, S., Panda, A., Mozafari, B., Iyer, A.P., Madden, S., Stoica, I.: Blink and it’s done: interactive queries on very large data. PVLDB 5(12), 1902–1905 (2012)
Google Scholar
Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. PVLDB 3(1), 330–339 (2010)
Google Scholar
Afrati, F.N., Delorey, D., Pasumansky, M., Ullman, J.D.: Storing and querying tree-structured records in dremel. PVLDB 7(12), 1131–1142 (2014)
Google Scholar
Kornacker, M., Behm, A., Bittorf, V., Bobrovytsky, T., Ching, C., et al.: Impala: a modern, open-source SQL engine for hadoop. In: CIDR 2015 (2015)
Google Scholar
Wanderman-Milne, S., Li, N.: Runtime code generation in Cloudera Impala. IEEE Data Eng. Bull. 37(1), 31–37 (2014)
Google Scholar
Agarwal, S., Milner, H., Kleiner, A., Talwalkar, A., Jordan, M.I., Madden, S., Mozafari, B., Stoica, I.: Knowing when you’re wrong: building fast and reliable approximate query processing systems. In: SIGMOD Conference 2014, pp. 481–492 (2014)
Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: HotCloud 2010 (2010)
Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI 2012, pp. 15–28 (2012)
Google Scholar
Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark SQL: relational data processing in spark. In: SIGMOD Conference 2015, pp. 1383–1394 (2015)
Google Scholar
Mozafari, B., Ramnarayan, J., Menon, S., Mahajan, Y., Chakraborty, S., Bhanawat, H., Bachhav, K.: SnappyData: a unified cluster for streaming, transactions and interactice analytics. In: CIDR 2017 (2017)
Google Scholar
Li, K., Li, G.: Approximate query processing: what is new and where to go? - a survey on approximate query processing. Data Sci. Eng. 3(4), 379–397 (2018)
Article Google Scholar
Han, X., Wang, B., Li, J., Gao, H.: Efficiently processing deterministic approximate aggregation query on massive data. Knowl. Inf. Syst. 57(2), 437–473 (2018)
Article Google Scholar
Park, Y., Mozafari, B., Sorenson, J., Wang, J.: VerdictDB: universalizing approximate query processing. In: SIGMOD Conference 2018, pp. 1461–1476 (2018)
Google Scholar
Peng, J., Zhang, D., Wang, J., Pei, J.: AQP++: connecting approximate query processing with aggregate precomputation for interactive analytics. In: SIGMOD Conference 2018, pp. 1477–1492 (2018)
Google Scholar
Galakatos, A., Crotty, A., Zgraggen, E., Binnig, C., Kraska, T.: Revisiting reuse for approximate query processing. PVLDB 10(10), 1142–1153 (2017)
Google Scholar
Chaudhuri, S., Ding, B., Kandula, S.: Approximate query processing: no silver bullet. In: SIGMOD Conference 2017, pp. 511–519 (2017)
Google Scholar
Kaiping, F., Hua, Z., Chaoying, F., Heng, C.: In: Application of Histogram Method on Cost Estimate in Query Optimization. Computer & Digital Engineering (2010)
Google Scholar
Acharya, S., Gibbons, P.B., Poosala, V.: Congressional samples for approximate answering of group-by queries. In: ACM SIGMOD, May 2000
Google Scholar
Cormode, G.: Sketch techniques for massive data. In: Synopses for Massive Data: Samples, Histograms, Wavelets and Sketches (2011)
Google Scholar

Download references

Acknowledgements

This paper is supported by Guizhou University Science and Technology Talent Support Program (No.KY [2016] 086).

Author information

Authors and Affiliations

Network and Information Center, Guizhou University of Commerce, Guiyang, 550014, China
Liang Yong & Mu Zhaonan

Authors

Liang Yong
View author publications
You can also search for this author in PubMed Google Scholar
Mu Zhaonan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liang Yong .

Editor information

Editors and Affiliations

Faculty of Applied Informatics, Tomas Bata University in Zlín, Zlín, Czech Republic
Radek Silhavy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yong, L., Zhaonan, M. (2019). Optimizing Performance of Aggregate Query Processing with Histogram Data Structure. In: Silhavy, R. (eds) Software Engineering Methods in Intelligent Algorithms. CSOC 2019. Advances in Intelligent Systems and Computing, vol 984. Springer, Cham. https://doi.org/10.1007/978-3-030-19807-7_33

Download citation

DOI: https://doi.org/10.1007/978-3-030-19807-7_33
Published: 08 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-19806-0
Online ISBN: 978-3-030-19807-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics