Skip to main content

Big Data Query Engines

  • Chapter
  • First Online:
Book cover Handbook of Big Data Technologies
  • 7284 Accesses

Abstract

Big data analytics are techniques that are used to analyze large datasets in order to extract patterns, trends, correlations and summaries. Analytics are used in several big data applications ranging from the generation of simple reports to running deep and complex query workloads. The insights drawn by running big data analytics depend primarily on the capabilities of the underlying query engine, which is responsible for translating user queries into efficient data retrieval and processing operations, as well as executing these operations on one or multiple nodes in order to find query answers. Classically, parallel database systems have been adopted in various domains, particularly enterprise data warehouses, as the data processing platform for running big data analytics. An SQL-based query engine, running on a shared-nothing cluster, is typically used by these platforms. Scalability is realized by partitioning data across multiple machines that communicate via a high speed interconnect layer. These systems often rely on dedicated expensive hardware resources in order to scale-out query processing and provide fault tolerance. With the emergence of Hadoop, it became possible to use cheap commodity hardware for achieving linear scalability and fault tolerance. A typical Hadoop environment involves a software stack running in one ecosystem, while sharing hardware resources across different systems, called tenants. Earlier Hadoop query engines leveraged programming frameworks such as MapReduce to run analytics using programs executed on a distributed file system. The Hadoop Distributed File System (HDFS) has been effectively used for batch processing of simple analytics. The need for coding and manual optimization of analytics, the lack of support to complex queries and the limited interactive processing capabilities, have triggered the need for adopting new technologies with more expressive query languages and advanced query processing techniques. Integrating parallel database systems into Hadoop ecosystem is an obvious approach to combine the advantages of both worlds. In this respect, multiple challenges needed to be addressed to fit a parallel database query engine in Hadoop software stack. Data placement, query optimization, query execution and resource management are some of the technical problems that are actively studied in this area. In this chapter, we discuss the state-of-the-art of query engines in parallel database systems, Hadoop-based systems, as well as the hybrid systems that integrate parallel databases and Hadoop technologies. We present the architectures of multiple example systems and highlight their similarity and differences. We also give an overview of the research problems and proposed techniques in the areas of query optimization and execution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 349.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 449.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 449.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. L. Antova, A., El-Helw, M.A., Soliman, Z., Gu, M. Petropoulos, Waas, F. Optimizing queries over partitioned tables in MPP systems, in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (2014)

    Google Scholar 

  2. M. Armbrust, R.S. Xin, C. Lian, Y. Huai, D. Liu, J.K. Bradley, X. Meng, T. Kaftan, M.J. Franklin, A. Ghodsi, M. Zaharia, Spark SQL: relational data processing in spark, in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (2015)

    Google Scholar 

  3. L. Chan, Presto: Interacting with petabytes of data at Facebook (2016). http://prestodb.io

  4. L. Chang, Z. Wang, T. Ma, L. Jian, L. Ma, A. Goldshuv, L. Lonergan, J. Cohen, C. Welton, G. Sherry, M. Bhandarkar, Hawq: a massively parallel processing SQL engine in hadoop, in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (2014)

    Google Scholar 

  5. J., Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters, in OSDI (2004), pp. 10–10

    Google Scholar 

  6. A. El-Helw, V. Raghavan, M.A. Soliman, G. Caragea, Z. Gu, M. Petropoulos, Optimization of common table expressions in MPP database systems, in Proceedings of the VLDB endowment (2015)

    Google Scholar 

  7. G. Graefe, Encapsulation of parallelism in the volcano query processing system, in SIGMOD (1990)

    Google Scholar 

  8. G. Graefe, Query evaluation techniques for large databases. ACM Comput. Surv. 25(2), 73–169 (1993)

    Article  Google Scholar 

  9. G. Graefe, The cascades framework for query optimization. IEEE Data Eng. Bull. 18(3), 19–29 (1995)

    Google Scholar 

  10. Z. Gu, M.A. Soliman, F.M. Waas, Testing the accuracy of query optimizers, in DBTest (2012)

    Google Scholar 

  11. HBase: Apache HBase (2016). https://hbase.apache.org

  12. Huai, Y., Chauhan, A., Gates, A., Hagleitner, G., Hanson, E.N., O?Malley, O., Pandey, J., Yuan, Y., Lee, R., Zhang, X.: Major technical advancements in apache hive, in SIGMOD (2014)

    Google Scholar 

  13. JSON: JSON (2016). http://www.json.org/

  14. M. Kornacker, J. Erickson, Cloudera Impala: Real-Time Queries in Apache Hadoop, for Real (2012). http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html

  15. K. Krikellas, S. Viglas, M. Cintra, in ICDE (2010)

    Google Scholar 

  16. A. Lamb, M. Fuller, R. Varadarajan, N. Tran, B. Vandiver, L. Doshi, C. Bear, The vertica analytic database C-store 7 years later. VLDB Endow 5(12), 1790–1801 (2012)

    Article  Google Scholar 

  17. C. Lattner, V. Adve, Llvm: a compilation framework for lifelong program analysis and transformation, in Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (2004)

    Google Scholar 

  18. S. Melnik, A. Gubarev, J.J. Long, G. Romer, S. Shivakumar, M. Tolton, T. Vassilakis, Dremel: interactive analysis of web-scale datasets. PVLDB 3(1), 330–339 (2010)

    Google Scholar 

  19. Neumann, T.: Efficiently compiling efficient query plans for modern hardware, in Proceedings of the VLDB Endow

    Google Scholar 

  20. Orca Open Source (2016). https://github.com/greenplum-db/gporca

  21. A. Pavlo, E. Paulson, A. Rasin, D.J. Abadi, D.J. DeWitt, S. Madden, M. Stonebraker, A comparison of approaches to large-scale data analysis, in SIGMOD 2009 (2009)

    Google Scholar 

  22. Pivotal: Apache HAWQ (2016). https://blog.pivotal.io/big-data-pivotal/products/introducing-the-newly-redesigned-apache-hawq

  23. Pivotal: Greenplum Database (2016). http://greenplum.org/

  24. Pivotal: HAWQ (2016). http://hawq.incubator.apache.org/

  25. PostgreSQL: PostgreSQL (2016). http://www.postgresql.org/

  26. Qubole: Presto as a service (2016). https://www.qubole.com/

  27. M.A. Soliman, L. Antova, V. Raghavan, A. El-Helw, Z. Gu, E. Shen, G.C. Caragea, C. Garcia-Alvarado, F. Rahman, M. Petropoulos, F. Waas, S., Narayanan, K. Krikellas, R. Baldwin, Orca: a modular query optimizer architecture for big data, in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (2014)

    Google Scholar 

  28. M. Stonebraker, D.J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E.J., O’Neil, P.E., O’Neil, A. Rasin, N. Tran, S.B. Zdonik, C-Store: a column-oriented DBMS, in VLDB (2005)

    Google Scholar 

  29. Teradata (2013). http://www.teradata.com/

  30. A. Thusoo, J.S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, R. Murthy, Hive - a petabyte scale data warehouse using hadoop, in ICDE (2010)

    Google Scholar 

  31. R.S. Xin, J. Rosen, M. Zaharia, M.J. Franklin, S. Shenker, I. Stoica, Shark: SQL and rich analytics at scale, in SIGMOD (2013)

    Google Scholar 

  32. Yarn: Yarn (2016). http://hortonworks.com/hadoop/yarn/

  33. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica, Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing, in NSDI 2012 (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohamed A. Soliman .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Soliman, M.A. (2017). Big Data Query Engines. In: Zomaya, A., Sakr, S. (eds) Handbook of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-49340-4_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-49340-4_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-49339-8

  • Online ISBN: 978-3-319-49340-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics