Encyclopedia of Big Data Technologies

Living Edition
| Editors: Sherif Sakr, Albert Zomaya

Query Optimization Challenges for SQL-on-Hadoop

  • Mohamed A. Soliman
Living reference work entry
DOI: https://doi.org/10.1007/978-3-319-63962-8_323-1


In database management systems, the query optimizer is the component responsible for mapping an input query to the most efficient mechanism of executing the query, called query execution plan. Query execution is a resource-intensive operation that consumes memory, I/O, and network bandwidth resources of the underlying database management system. Query optimizer builds a space of plan alternatives capturing the different ways of executing an input query, such as different orderings of joins among the tables referenced by the query. Each plan alternative is assessed using a cost model that computes a cost estimate reflecting a prediction of the plan’s wall clock running time. The optimizer picks the most efficient execution plan according to such cost estimates.


The job of a query optimizer is to turn a user query into an efficient query execution plan. The optimizer typically generates the execution plan by considering a large space of possible alternative plans and...

This is a preview of subscription content, log in to check access.


  1. Antova L, El-Helw A, Soliman MA, Gu Z, Petropoulos M, Waas F (2014) Optimizing queries over partitioned tables in MPP systems. In: Proceedings of the 2014 ACM SIGMOD international conference on management of dataGoogle Scholar
  2. Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A, Zaharia M (2015) Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD international conference on management of dataGoogle Scholar
  3. Apache Calcite (2018) https://calcite.apache.org
  4. El-Helw A, Raghavan V, Soliman MA, Caragea G, Gu Z, Petropoulos M (2015) Optimization of common table expressions in MPP database systems. Proc VLDB Endow 8:1704–1715CrossRefGoogle Scholar
  5. Graefe G (1995) The cascades framework for query optimization. IEEE Data Eng Bull 18(3):19–29Google Scholar
  6. Kornacker M, Erickson J (2012) Cloudera impala: real-time queries in Apache Hadoop, for real. http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html
  7. Pivotal (2018a) Greenplum database. http://greenplum.org/
  8. Pivotal (2018b) HAWQ. http://hawq.incubator.apache.org/
  9. Soliman MA, Antova L, Raghavan V, El-Helw A, Gu Z, Shen E, Caragea GC, Garcia-Alvarado C, Rahman F, Petropoulos M, Waas F, Narayanan S, Krikellas K, Baldwin R (2014) Orca: a modular query optimizer architecture for big data. In: Proceedings of the 2014 ACM SIGMOD international conference on management of dataGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Datometry, Inc.San FranciscoUSA

Section editors and affiliations

  • Yuanyuan Tian
    • 1
  • Fatma Özcan
    • 2
  1. 1.IBM Almaden Research CenterSAN JOSEUnited States
  2. 2.IBM Research – AlmadenSan JoseUSA