Abstract
Big data analytics are techniques that are used to analyze large datasets in order to extract patterns, trends, correlations and summaries. Analytics are used in several big data applications ranging from the generation of simple reports to running deep and complex query workloads. The insights drawn by running big data analytics depend primarily on the capabilities of the underlying query engine, which is responsible for translating user queries into efficient data retrieval and processing operations, as well as executing these operations on one or multiple nodes in order to find query answers. Classically, parallel database systems have been adopted in various domains, particularly enterprise data warehouses, as the data processing platform for running big data analytics. An SQL-based query engine, running on a shared-nothing cluster, is typically used by these platforms. Scalability is realized by partitioning data across multiple machines that communicate via a high speed interconnect layer. These systems often rely on dedicated expensive hardware resources in order to scale-out query processing and provide fault tolerance. With the emergence of Hadoop, it became possible to use cheap commodity hardware for achieving linear scalability and fault tolerance. A typical Hadoop environment involves a software stack running in one ecosystem, while sharing hardware resources across different systems, called tenants. Earlier Hadoop query engines leveraged programming frameworks such as MapReduce to run analytics using programs executed on a distributed file system. The Hadoop Distributed File System (HDFS) has been effectively used for batch processing of simple analytics. The need for coding and manual optimization of analytics, the lack of support to complex queries and the limited interactive processing capabilities, have triggered the need for adopting new technologies with more expressive query languages and advanced query processing techniques. Integrating parallel database systems into Hadoop ecosystem is an obvious approach to combine the advantages of both worlds. In this respect, multiple challenges needed to be addressed to fit a parallel database query engine in Hadoop software stack. Data placement, query optimization, query execution and resource management are some of the technical problems that are actively studied in this area. In this chapter, we discuss the state-of-the-art of query engines in parallel database systems, Hadoop-based systems, as well as the hybrid systems that integrate parallel databases and Hadoop technologies. We present the architectures of multiple example systems and highlight their similarity and differences. We also give an overview of the research problems and proposed techniques in the areas of query optimization and execution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
L. Antova, A., El-Helw, M.A., Soliman, Z., Gu, M. Petropoulos, Waas, F. Optimizing queries over partitioned tables in MPP systems, in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (2014)
M. Armbrust, R.S. Xin, C. Lian, Y. Huai, D. Liu, J.K. Bradley, X. Meng, T. Kaftan, M.J. Franklin, A. Ghodsi, M. Zaharia, Spark SQL: relational data processing in spark, in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (2015)
L. Chan, Presto: Interacting with petabytes of data at Facebook (2016). http://prestodb.io
L. Chang, Z. Wang, T. Ma, L. Jian, L. Ma, A. Goldshuv, L. Lonergan, J. Cohen, C. Welton, G. Sherry, M. Bhandarkar, Hawq: a massively parallel processing SQL engine in hadoop, in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (2014)
J., Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters, in OSDI (2004), pp. 10–10
A. El-Helw, V. Raghavan, M.A. Soliman, G. Caragea, Z. Gu, M. Petropoulos, Optimization of common table expressions in MPP database systems, in Proceedings of the VLDB endowment (2015)
G. Graefe, Encapsulation of parallelism in the volcano query processing system, in SIGMOD (1990)
G. Graefe, Query evaluation techniques for large databases. ACM Comput. Surv. 25(2), 73–169 (1993)
G. Graefe, The cascades framework for query optimization. IEEE Data Eng. Bull. 18(3), 19–29 (1995)
Z. Gu, M.A. Soliman, F.M. Waas, Testing the accuracy of query optimizers, in DBTest (2012)
HBase: Apache HBase (2016). https://hbase.apache.org
Huai, Y., Chauhan, A., Gates, A., Hagleitner, G., Hanson, E.N., O?Malley, O., Pandey, J., Yuan, Y., Lee, R., Zhang, X.: Major technical advancements in apache hive, in SIGMOD (2014)
JSON: JSON (2016). http://www.json.org/
M. Kornacker, J. Erickson, Cloudera Impala: Real-Time Queries in Apache Hadoop, for Real (2012). http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html
K. Krikellas, S. Viglas, M. Cintra, in ICDE (2010)
A. Lamb, M. Fuller, R. Varadarajan, N. Tran, B. Vandiver, L. Doshi, C. Bear, The vertica analytic database C-store 7 years later. VLDB Endow 5(12), 1790–1801 (2012)
C. Lattner, V. Adve, Llvm: a compilation framework for lifelong program analysis and transformation, in Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (2004)
S. Melnik, A. Gubarev, J.J. Long, G. Romer, S. Shivakumar, M. Tolton, T. Vassilakis, Dremel: interactive analysis of web-scale datasets. PVLDB 3(1), 330–339 (2010)
Neumann, T.: Efficiently compiling efficient query plans for modern hardware, in Proceedings of the VLDB Endow
Orca Open Source (2016). https://github.com/greenplum-db/gporca
A. Pavlo, E. Paulson, A. Rasin, D.J. Abadi, D.J. DeWitt, S. Madden, M. Stonebraker, A comparison of approaches to large-scale data analysis, in SIGMOD 2009 (2009)
Pivotal: Apache HAWQ (2016). https://blog.pivotal.io/big-data-pivotal/products/introducing-the-newly-redesigned-apache-hawq
Pivotal: Greenplum Database (2016). http://greenplum.org/
Pivotal: HAWQ (2016). http://hawq.incubator.apache.org/
PostgreSQL: PostgreSQL (2016). http://www.postgresql.org/
Qubole: Presto as a service (2016). https://www.qubole.com/
M.A. Soliman, L. Antova, V. Raghavan, A. El-Helw, Z. Gu, E. Shen, G.C. Caragea, C. Garcia-Alvarado, F. Rahman, M. Petropoulos, F. Waas, S., Narayanan, K. Krikellas, R. Baldwin, Orca: a modular query optimizer architecture for big data, in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (2014)
M. Stonebraker, D.J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E.J., O’Neil, P.E., O’Neil, A. Rasin, N. Tran, S.B. Zdonik, C-Store: a column-oriented DBMS, in VLDB (2005)
Teradata (2013). http://www.teradata.com/
A. Thusoo, J.S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, R. Murthy, Hive - a petabyte scale data warehouse using hadoop, in ICDE (2010)
R.S. Xin, J. Rosen, M. Zaharia, M.J. Franklin, S. Shenker, I. Stoica, Shark: SQL and rich analytics at scale, in SIGMOD (2013)
Yarn: Yarn (2016). http://hortonworks.com/hadoop/yarn/
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica, Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing, in NSDI 2012 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Soliman, M.A. (2017). Big Data Query Engines. In: Zomaya, A., Sakr, S. (eds) Handbook of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-49340-4_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-49340-4_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49339-8
Online ISBN: 978-3-319-49340-4
eBook Packages: Computer ScienceComputer Science (R0)