Big Data Query Engines

Soliman, Mohamed A.

doi:10.1007/978-3-319-49340-4_6

Mohamed A. Soliman³

7284 Accesses

Abstract

Big data analytics are techniques that are used to analyze large datasets in order to extract patterns, trends, correlations and summaries. Analytics are used in several big data applications ranging from the generation of simple reports to running deep and complex query workloads. The insights drawn by running big data analytics depend primarily on the capabilities of the underlying query engine, which is responsible for translating user queries into efficient data retrieval and processing operations, as well as executing these operations on one or multiple nodes in order to find query answers. Classically, parallel database systems have been adopted in various domains, particularly enterprise data warehouses, as the data processing platform for running big data analytics. An SQL-based query engine, running on a shared-nothing cluster, is typically used by these platforms. Scalability is realized by partitioning data across multiple machines that communicate via a high speed interconnect layer. These systems often rely on dedicated expensive hardware resources in order to scale-out query processing and provide fault tolerance. With the emergence of Hadoop, it became possible to use cheap commodity hardware for achieving linear scalability and fault tolerance. A typical Hadoop environment involves a software stack running in one ecosystem, while sharing hardware resources across different systems, called tenants. Earlier Hadoop query engines leveraged programming frameworks such as MapReduce to run analytics using programs executed on a distributed file system. The Hadoop Distributed File System (HDFS) has been effectively used for batch processing of simple analytics. The need for coding and manual optimization of analytics, the lack of support to complex queries and the limited interactive processing capabilities, have triggered the need for adopting new technologies with more expressive query languages and advanced query processing techniques. Integrating parallel database systems into Hadoop ecosystem is an obvious approach to combine the advantages of both worlds. In this respect, multiple challenges needed to be addressed to fit a parallel database query engine in Hadoop software stack. Data placement, query optimization, query execution and resource management are some of the technical problems that are actively studied in this area. In this chapter, we discuss the state-of-the-art of query engines in parallel database systems, Hadoop-based systems, as well as the hybrid systems that integrate parallel databases and Hadoop technologies. We present the architectures of multiple example systems and highlight their similarity and differences. We also give an overview of the research problems and proposed techniques in the areas of query optimization and execution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 349.00; Price excludes VAT (USA)

Softcover Book: USD 449.99; Price excludes VAT (USA)

Hardcover Book: USD 449.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

L. Antova, A., El-Helw, M.A., Soliman, Z., Gu, M. Petropoulos, Waas, F. Optimizing queries over partitioned tables in MPP systems, in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (2014)
Google Scholar
M. Armbrust, R.S. Xin, C. Lian, Y. Huai, D. Liu, J.K. Bradley, X. Meng, T. Kaftan, M.J. Franklin, A. Ghodsi, M. Zaharia, Spark SQL: relational data processing in spark, in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (2015)
Google Scholar
L. Chan, Presto: Interacting with petabytes of data at Facebook (2016). http://prestodb.io
L. Chang, Z. Wang, T. Ma, L. Jian, L. Ma, A. Goldshuv, L. Lonergan, J. Cohen, C. Welton, G. Sherry, M. Bhandarkar, Hawq: a massively parallel processing SQL engine in hadoop, in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (2014)
Google Scholar
J., Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters, in OSDI (2004), pp. 10–10
Google Scholar
A. El-Helw, V. Raghavan, M.A. Soliman, G. Caragea, Z. Gu, M. Petropoulos, Optimization of common table expressions in MPP database systems, in Proceedings of the VLDB endowment (2015)
Google Scholar
G. Graefe, Encapsulation of parallelism in the volcano query processing system, in SIGMOD (1990)
Google Scholar
G. Graefe, Query evaluation techniques for large databases. ACM Comput. Surv. 25(2), 73–169 (1993)
Article Google Scholar
G. Graefe, The cascades framework for query optimization. IEEE Data Eng. Bull. 18(3), 19–29 (1995)
Google Scholar
Z. Gu, M.A. Soliman, F.M. Waas, Testing the accuracy of query optimizers, in DBTest (2012)
Google Scholar
HBase: Apache HBase (2016). https://hbase.apache.org
Huai, Y., Chauhan, A., Gates, A., Hagleitner, G., Hanson, E.N., O?Malley, O., Pandey, J., Yuan, Y., Lee, R., Zhang, X.: Major technical advancements in apache hive, in SIGMOD (2014)
Google Scholar
JSON: JSON (2016). http://www.json.org/
M. Kornacker, J. Erickson, Cloudera Impala: Real-Time Queries in Apache Hadoop, for Real (2012). http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html
K. Krikellas, S. Viglas, M. Cintra, in ICDE (2010)
Google Scholar
A. Lamb, M. Fuller, R. Varadarajan, N. Tran, B. Vandiver, L. Doshi, C. Bear, The vertica analytic database C-store 7 years later. VLDB Endow 5(12), 1790–1801 (2012)
Article Google Scholar
C. Lattner, V. Adve, Llvm: a compilation framework for lifelong program analysis and transformation, in Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (2004)
Google Scholar
S. Melnik, A. Gubarev, J.J. Long, G. Romer, S. Shivakumar, M. Tolton, T. Vassilakis, Dremel: interactive analysis of web-scale datasets. PVLDB 3(1), 330–339 (2010)
Google Scholar
Neumann, T.: Efficiently compiling efficient query plans for modern hardware, in Proceedings of the VLDB Endow
Google Scholar
Orca Open Source (2016). https://github.com/greenplum-db/gporca
A. Pavlo, E. Paulson, A. Rasin, D.J. Abadi, D.J. DeWitt, S. Madden, M. Stonebraker, A comparison of approaches to large-scale data analysis, in SIGMOD 2009 (2009)
Google Scholar
Pivotal: Apache HAWQ (2016). https://blog.pivotal.io/big-data-pivotal/products/introducing-the-newly-redesigned-apache-hawq
Pivotal: Greenplum Database (2016). http://greenplum.org/
Pivotal: HAWQ (2016). http://hawq.incubator.apache.org/
PostgreSQL: PostgreSQL (2016). http://www.postgresql.org/
Qubole: Presto as a service (2016). https://www.qubole.com/
M.A. Soliman, L. Antova, V. Raghavan, A. El-Helw, Z. Gu, E. Shen, G.C. Caragea, C. Garcia-Alvarado, F. Rahman, M. Petropoulos, F. Waas, S., Narayanan, K. Krikellas, R. Baldwin, Orca: a modular query optimizer architecture for big data, in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (2014)
Google Scholar
M. Stonebraker, D.J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E.J., O’Neil, P.E., O’Neil, A. Rasin, N. Tran, S.B. Zdonik, C-Store: a column-oriented DBMS, in VLDB (2005)
Google Scholar
Teradata (2013). http://www.teradata.com/
A. Thusoo, J.S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, R. Murthy, Hive - a petabyte scale data warehouse using hadoop, in ICDE (2010)
Google Scholar
R.S. Xin, J. Rosen, M. Zaharia, M.J. Franklin, S. Shenker, I. Stoica, Shark: SQL and rich analytics at scale, in SIGMOD (2013)
Google Scholar
Yarn: Yarn (2016). http://hortonworks.com/hadoop/yarn/
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica, Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing, in NSDI 2012 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Datometry Inc., San Francisco, CA, USA
Mohamed A. Soliman

Authors

Mohamed A. Soliman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohamed A. Soliman .

Editor information

Editors and Affiliations

School of Information Technologies, The University of Sydney, Sydney, New South Wales, Australia
Albert Y. Zomaya
The School of Computer Science, The University of New South Wales, Eveleigh, New South Wales, Australia
Sherif Sakr

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Soliman, M.A. (2017). Big Data Query Engines. In: Zomaya, A., Sakr, S. (eds) Handbook of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-49340-4_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-49340-4_6
Published: 26 February 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49339-8
Online ISBN: 978-3-319-49340-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics