SQL Analytics on Big Data

Özcan, Fatma; Pandis, Ippokratis

doi:10.1007/978-1-4614-8265-9_80648

Fatma Özcan³ &
Ippokratis Pandis^4,5

105 Accesses

Synonyms

SQL-on-Hadoop

Definition

Over the last decade, the database field has witnessed significant major innovations and changes in enterprise data platforms. First came the wave of NoSQL systems, which provide high scalability, although sometimes at the expense of ACID transactions and declarative SQL processing. On the analytics side, Hadoop emerged as the platform for all analytics needs of the enterprise. Although Hadoop started with just the MapReduce processing framework and the Hadoop File System (HDFS), it evolved into a multi-framework environment, supporting MapReduce, Spark, Tez, and others. Such processing environments, where data can be accessed and manipulated by multiple processing frameworks, are frequently referred to as data lakes. Given the popularity of SQL and its widespread use in enterprise analytics tools, it was soon evident that SQL processing on data lakes is critical in this new emerging enterprise data platform.

In this entry, we discuss SQL analytics...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 4,499.99; Price excludes VAT (USA)

Hardcover Book: USD 6,499.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Recommended Reading

Abadi D, Babu S, Özcan F, Pandis I. SQL-on-Hadoop systems: tutorial. Proc VLDB Endow. 2015;8(12):2050–2051.
Article Google Scholar
Abouzeid A, Bajda-Pawlikowski K, Abadi DJ, Rasin A, Silberschatz A. HadoopDB: an architectural hybrid of mapReduce and DBMS technologies for analytical workloads. Proc VLDB Endow. 2009;2(1):922–933.
Article Google Scholar
Amburst M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A, Zaharia M. Spark SQL: relational data processing in spark. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2015.
Google Scholar
Apache Drill. http://drill.apache.org/.
Apache Phoenix. http://phoenix.apache.org/.
Apache spark. https://spark.incubator.apache.org/.
Apache Calcite. https://calcite.apache.org/.
Apache HBase. https://hbase.apache.org/.
Apache ORC. https://orc.apache.org/.
Apache Parquet. https://parquet.apache.org/.
Bajda-Pawlikowski K, Abadi DJ, Silberschatz A, Paulson E. Efficient processing of data warehousing queries in a split execution environment. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2011.
Google Scholar
Chang L, Wang Z, Ma T, Jian L, Ma L, Goldshuv A, Lonergan L, Cohen J, Welton C, Sherry G, Bhandarkar M. HAWQ: a massively parallel processing SQL engine in Hadoop. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2014.
Google Scholar
Costea A, Ionescu A, Răducanu B, Switakowski M, Bârca C, Sompolski J, Luszczak A, Szafrański M, de Nijs G, Boncz P. VectorH: taking SQL-on-Hadoop to the next level. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2016.
Google Scholar
DeWitt DJ, Nehme RV, Shankar S, Aguilar-Saborit J, Avanes A, Flasza M, Gramling J. Split query processing in polybase. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2013. p. 1255–66.
Google Scholar
Floratou A, Minhas UF, Özcan F. SQL-on-Hadoop: full circle back to shared-nothing database architectures. Proc VLDB Endow. 2014;7(12):1295–306.
Article Google Scholar
Gassner P, Lohman GM, Schiefer KB, Wang Y. Query optimization in the IBM DB2 family. IEEE Data Eng Bull. 1993;16(4):4–18.
Google Scholar
Graefe G. Encapsulation of parallelism in the Volcano query processing system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 1990.
Google Scholar
Gray S, Özcan F, Pereyra H, van der Linden B, Zubiri A. IBM Big SQL 3.0: SQL-on-Hadoop without compromise (2014), http://public.dhe.ibm.com/common/ssi/ecm/en/sww14019usen/SWW14019USEN.PDF
He Y, Lee R, Huai Y, Shao Z, Jain N, Zhang X, Xu Z. Rcfile: a fast and space-efficient data placement structure in mapreduce-based warehouse systems. In: Proceedings of the 27th International Conference on Data Engineering; 2011. p. 1199–208.
Google Scholar
Hive on spark. https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark.
Kornacker M, Behm A, Bittorf V, Bobrovytsky T, Ching C, Choi A, Erickson J, Grund M, Hecht D, Jacobs M, Joshi I, Kuff L, Kumar D, Leblang A, Li N, Pandis I, Robinson H, Rorke D, Rus S, Russell J, Tsirogiannis D, Wanderman-Milne S, Yoder M. Impala: a modern, open-source SQL engine for Hadoop. In: Proceedings of the 7th Biennial Conference on Innovative Data Systems Research; 2015.
Google Scholar
Lipcon T, Alves D, Burkert D, Cryans J-D, Dembo A, Percy M, Rus S, Wang D, Bertozzi M, McCabe CP, Wang A. Kudu: storage for fast analytics on fast data. https://kudu.apache.org/.
Melnik S, Gubarev A, Long JJ, Romer G, Shivakumar S, Tolton M, Vassilakis T. Dremel: interactive analysis of web-scale datasets. Proc VLDB Endow. 2010;3(1–2):330–39.
Article Google Scholar
Ongaro D, Ousterhout J. In search of an understandable consensus algorithm. In: Proceedings of the USENIX Annual Technical Conference; 2014.
Google Scholar
Padmanabhan S, Malkemus T, Agarwal RC, Jhingran A. Block oriented processing of relational database operations in modern computer architectures. In: Proceedings of the 17th International Conference on Data Engineering; 2001.
Google Scholar
Presto. http://prestodb.io/.
Saha B, Shah H, Seth S, Vijayaraghavan G, Murthy A, Curino C. Apache Tez: a unifying framework for modeling and building data processing applications. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2015.
Google Scholar
Seshadri P, Pirahesh H, Leung TYC. Complex query decorrelation. In: Proceedings of the 12th International Conference on Data Engineering; 1996.
Google Scholar
Splice machine. http://www.splicemachine.com/.
Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Zhang N, Anthony S, Liu H, Murthy R. Hive – a petabyte scale data warehouse using Hadoop. In: Proceedings of the 26th International Conference on Data Engineering; 2010.
Google Scholar
Traverso M. Presto: interacting with petabytes of data at Facebook. https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920.
Wanderman-Milne S, Li N. Runtime code generation in Cloudera Impala. IEEE Data Eng Bull. 2014;37(1):31–7.
Google Scholar
Xin RS, Rosen J, Zaharia M, Franklin MJ, Shenker S, Stoica I. Shark: SQL and rich analytics at scale. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2013.
Google Scholar
Zuzarte C, Pirahesh H, Ma W, Cheng Q, Liu L, Wong K. WinMagic: subquery elimination using window aggregation. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2003.
Google Scholar

Download references

Author information

Authors and Affiliations

IBM Research – Almaden, San Jose, CA, USA
Fatma Özcan
Carnegie Mellon University, Pittsburgh, PA, USA
Ippokratis Pandis
Amazon Web Services, Seattle, WA, USA
Ippokratis Pandis

Authors

Fatma Özcan
View author publications
You can also search for this author in PubMed Google Scholar
Ippokratis Pandis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fatma Özcan .

Editor information

Editors and Affiliations

Georgia Institute of Technology College of Computing, Atlanta, GA, USA
Ling Liu
University of Waterloo School of Computer Science, Waterloo, ON, Canada
M. Tamer Özsu

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Özcan, F., Pandis, I. (2018). SQL Analytics on Big Data. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_80648

Download citation

DOI: https://doi.org/10.1007/978-1-4614-8265-9_80648
Published: 07 December 2018
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics