Skip to main content

SQL Analytics on Big Data

  • Reference work entry
  • First Online:
Encyclopedia of Database Systems

Synonyms

SQL-on-Hadoop

Definition

Over the last decade, the database field has witnessed significant major innovations and changes in enterprise data platforms. First came the wave of NoSQL systems, which provide high scalability, although sometimes at the expense of ACID transactions and declarative SQL processing. On the analytics side, Hadoop emerged as the platform for all analytics needs of the enterprise. Although Hadoop started with just the MapReduce processing framework and the Hadoop File System (HDFS), it evolved into a multi-framework environment, supporting MapReduce, Spark, Tez, and others. Such processing environments, where data can be accessed and manipulated by multiple processing frameworks, are frequently referred to as data lakes. Given the popularity of SQL and its widespread use in enterprise analytics tools, it was soon evident that SQL processing on data lakes is critical in this new emerging enterprise data platform.

In this entry, we discuss SQL analytics...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 4,499.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 6,499.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Recommended Reading

  1. Abadi D, Babu S, Özcan F, Pandis I. SQL-on-Hadoop systems: tutorial. Proc VLDB Endow. 2015;8(12):2050–2051.

    Article  Google Scholar 

  2. Abouzeid A, Bajda-Pawlikowski K, Abadi DJ, Rasin A, Silberschatz A. HadoopDB: an architectural hybrid of mapReduce and DBMS technologies for analytical workloads. Proc VLDB Endow. 2009;2(1):922–933.

    Article  Google Scholar 

  3. Amburst M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A, Zaharia M. Spark SQL: relational data processing in spark. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2015.

    Google Scholar 

  4. Apache Drill. http://drill.apache.org/.

  5. Apache Phoenix. http://phoenix.apache.org/.

  6. Apache spark. https://spark.incubator.apache.org/.

  7. Apache Calcite. https://calcite.apache.org/.

  8. Apache HBase. https://hbase.apache.org/.

  9. Apache ORC. https://orc.apache.org/.

  10. Apache Parquet. https://parquet.apache.org/.

  11. Bajda-Pawlikowski K, Abadi DJ, Silberschatz A, Paulson E. Efficient processing of data warehousing queries in a split execution environment. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2011.

    Google Scholar 

  12. Chang L, Wang Z, Ma T, Jian L, Ma L, Goldshuv A, Lonergan L, Cohen J, Welton C, Sherry G, Bhandarkar M. HAWQ: a massively parallel processing SQL engine in Hadoop. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2014.

    Google Scholar 

  13. Costea A, Ionescu A, Răducanu B, Switakowski M, Bârca C, Sompolski J, Luszczak A, Szafrański M, de Nijs G, Boncz P. VectorH: taking SQL-on-Hadoop to the next level. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2016.

    Google Scholar 

  14. DeWitt DJ, Nehme RV, Shankar S, Aguilar-Saborit J, Avanes A, Flasza M, Gramling J. Split query processing in polybase. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2013. p. 1255–66.

    Google Scholar 

  15. Floratou A, Minhas UF, Özcan F. SQL-on-Hadoop: full circle back to shared-nothing database architectures. Proc VLDB Endow. 2014;7(12):1295–306.

    Article  Google Scholar 

  16. Gassner P, Lohman GM, Schiefer KB, Wang Y. Query optimization in the IBM DB2 family. IEEE Data Eng Bull. 1993;16(4):4–18.

    Google Scholar 

  17. Graefe G. Encapsulation of parallelism in the Volcano query processing system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 1990.

    Google Scholar 

  18. Gray S, Özcan F, Pereyra H, van der Linden B, Zubiri A. IBM Big SQL 3.0: SQL-on-Hadoop without compromise (2014), http://public.dhe.ibm.com/common/ssi/ecm/en/sww14019usen/SWW14019USEN.PDF

  19. He Y, Lee R, Huai Y, Shao Z, Jain N, Zhang X, Xu Z. Rcfile: a fast and space-efficient data placement structure in mapreduce-based warehouse systems. In: Proceedings of the 27th International Conference on Data Engineering; 2011. p. 1199–208.

    Google Scholar 

  20. Hive on spark. https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark.

  21. Kornacker M, Behm A, Bittorf V, Bobrovytsky T, Ching C, Choi A, Erickson J, Grund M, Hecht D, Jacobs M, Joshi I, Kuff L, Kumar D, Leblang A, Li N, Pandis I, Robinson H, Rorke D, Rus S, Russell J, Tsirogiannis D, Wanderman-Milne S, Yoder M. Impala: a modern, open-source SQL engine for Hadoop. In: Proceedings of the 7th Biennial Conference on Innovative Data Systems Research; 2015.

    Google Scholar 

  22. Lipcon T, Alves D, Burkert D, Cryans J-D, Dembo A, Percy M, Rus S, Wang D, Bertozzi M, McCabe CP, Wang A. Kudu: storage for fast analytics on fast data. https://kudu.apache.org/.

  23. Melnik S, Gubarev A, Long JJ, Romer G, Shivakumar S, Tolton M, Vassilakis T. Dremel: interactive analysis of web-scale datasets. Proc VLDB Endow. 2010;3(1–2):330–39.

    Article  Google Scholar 

  24. Ongaro D, Ousterhout J. In search of an understandable consensus algorithm. In: Proceedings of the USENIX Annual Technical Conference; 2014.

    Google Scholar 

  25. Padmanabhan S, Malkemus T, Agarwal RC, Jhingran A. Block oriented processing of relational database operations in modern computer architectures. In: Proceedings of the 17th International Conference on Data Engineering; 2001.

    Google Scholar 

  26. Presto. http://prestodb.io/.

  27. Saha B, Shah H, Seth S, Vijayaraghavan G, Murthy A, Curino C. Apache Tez: a unifying framework for modeling and building data processing applications. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2015.

    Google Scholar 

  28. Seshadri P, Pirahesh H, Leung TYC. Complex query decorrelation. In: Proceedings of the 12th International Conference on Data Engineering; 1996.

    Google Scholar 

  29. Splice machine. http://www.splicemachine.com/.

  30. Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Zhang N, Anthony S, Liu H, Murthy R. Hive – a petabyte scale data warehouse using Hadoop. In: Proceedings of the 26th International Conference on Data Engineering; 2010.

    Google Scholar 

  31. Traverso M. Presto: interacting with petabytes of data at Facebook. https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920.

  32. Wanderman-Milne S, Li N. Runtime code generation in Cloudera Impala. IEEE Data Eng Bull. 2014;37(1):31–7.

    Google Scholar 

  33. Xin RS, Rosen J, Zaharia M, Franklin MJ, Shenker S, Stoica I. Shark: SQL and rich analytics at scale. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2013.

    Google Scholar 

  34. Zuzarte C, Pirahesh H, Ma W, Cheng Q, Liu L, Wong K. WinMagic: subquery elimination using window aggregation. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2003.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fatma Özcan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Science+Business Media, LLC, part of Springer Nature

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Özcan, F., Pandis, I. (2018). SQL Analytics on Big Data. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_80648

Download citation

Publish with us

Policies and ethics