Encyclopedia of Database Systems

2018 Edition
| Editors: Ling Liu, M. Tamer Özsu

SQL Analytics on Big Data

  • Fatma Özcan
  • Ippokratis Pandis
Reference work entry
DOI: https://doi.org/10.1007/978-1-4614-8265-9_80648

Synonyms

SQL-on-Hadoop

Definition

Over the last decade, the database field has witnessed significant major innovations and changes in enterprise data platforms. First came the wave of NoSQL systems, which provide high scalability, although sometimes at the expense of ACID transactions and declarative SQL processing. On the analytics side, Hadoop emerged as the platform for all analytics needs of the enterprise. Although Hadoop started with just the MapReduce processing framework and the Hadoop File System (HDFS), it evolved into a multi-framework environment, supporting MapReduce, Spark, Tez, and others. Such processing environments, where data can be accessed and manipulated by multiple processing frameworks, are frequently referred to as data lakes. Given the popularity of SQL and its widespread use in enterprise analytics tools, it was soon evident that SQL processing on data lakes is critical in this new emerging enterprise data platform.

In this entry, we discuss SQL analytics...

This is a preview of subscription content, log in to check access.

Recommended Reading

  1. 1.
    Abadi D, Babu S, Özcan F, Pandis I. SQL-on-Hadoop systems: tutorial. Proc VLDB Endow. 2015;8(12):2050–2051.CrossRefGoogle Scholar
  2. 2.
    Abouzeid A, Bajda-Pawlikowski K, Abadi DJ, Rasin A, Silberschatz A. HadoopDB: an architectural hybrid of mapReduce and DBMS technologies for analytical workloads. Proc VLDB Endow. 2009;2(1):922–933.CrossRefGoogle Scholar
  3. 3.
    Amburst M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A, Zaharia M. Spark SQL: relational data processing in spark. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2015.Google Scholar
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
    Bajda-Pawlikowski K, Abadi DJ, Silberschatz A, Paulson E. Efficient processing of data warehousing queries in a split execution environment. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2011.Google Scholar
  12. 12.
    Chang L, Wang Z, Ma T, Jian L, Ma L, Goldshuv A, Lonergan L, Cohen J, Welton C, Sherry G, Bhandarkar M. HAWQ: a massively parallel processing SQL engine in Hadoop. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2014.Google Scholar
  13. 13.
    Costea A, Ionescu A, Răducanu B, Switakowski M, Bârca C, Sompolski J, Luszczak A, Szafrański M, de Nijs G, Boncz P. VectorH: taking SQL-on-Hadoop to the next level. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2016.Google Scholar
  14. 14.
    DeWitt DJ, Nehme RV, Shankar S, Aguilar-Saborit J, Avanes A, Flasza M, Gramling J. Split query processing in polybase. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2013. p. 1255–66.Google Scholar
  15. 15.
    Floratou A, Minhas UF, Özcan F. SQL-on-Hadoop: full circle back to shared-nothing database architectures. Proc VLDB Endow. 2014;7(12):1295–306.CrossRefGoogle Scholar
  16. 16.
    Gassner P, Lohman GM, Schiefer KB, Wang Y. Query optimization in the IBM DB2 family. IEEE Data Eng Bull. 1993;16(4):4–18.Google Scholar
  17. 17.
    Graefe G. Encapsulation of parallelism in the Volcano query processing system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 1990.Google Scholar
  18. 18.
    Gray S, Özcan F, Pereyra H, van der Linden B, Zubiri A. IBM Big SQL 3.0: SQL-on-Hadoop without compromise (2014), http://public.dhe.ibm.com/common/ssi/ecm/en/sww14019usen/SWW14019USEN.PDF
  19. 19.
    He Y, Lee R, Huai Y, Shao Z, Jain N, Zhang X, Xu Z. Rcfile: a fast and space-efficient data placement structure in mapreduce-based warehouse systems. In: Proceedings of the 27th International Conference on Data Engineering; 2011. p. 1199–208.Google Scholar
  20. 20.
  21. 21.
    Kornacker M, Behm A, Bittorf V, Bobrovytsky T, Ching C, Choi A, Erickson J, Grund M, Hecht D, Jacobs M, Joshi I, Kuff L, Kumar D, Leblang A, Li N, Pandis I, Robinson H, Rorke D, Rus S, Russell J, Tsirogiannis D, Wanderman-Milne S, Yoder M. Impala: a modern, open-source SQL engine for Hadoop. In: Proceedings of the 7th Biennial Conference on Innovative Data Systems Research; 2015.Google Scholar
  22. 22.
    Lipcon T, Alves D, Burkert D, Cryans J-D, Dembo A, Percy M, Rus S, Wang D, Bertozzi M, McCabe CP, Wang A. Kudu: storage for fast analytics on fast data. https://kudu.apache.org/.
  23. 23.
    Melnik S, Gubarev A, Long JJ, Romer G, Shivakumar S, Tolton M, Vassilakis T. Dremel: interactive analysis of web-scale datasets. Proc VLDB Endow. 2010;3(1–2):330–39.CrossRefGoogle Scholar
  24. 24.
    Ongaro D, Ousterhout J. In search of an understandable consensus algorithm. In: Proceedings of the USENIX Annual Technical Conference; 2014.Google Scholar
  25. 25.
    Padmanabhan S, Malkemus T, Agarwal RC, Jhingran A. Block oriented processing of relational database operations in modern computer architectures. In: Proceedings of the 17th International Conference on Data Engineering; 2001.Google Scholar
  26. 26.
  27. 27.
    Saha B, Shah H, Seth S, Vijayaraghavan G, Murthy A, Curino C. Apache Tez: a unifying framework for modeling and building data processing applications. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2015.Google Scholar
  28. 28.
    Seshadri P, Pirahesh H, Leung TYC. Complex query decorrelation. In: Proceedings of the 12th International Conference on Data Engineering; 1996.Google Scholar
  29. 29.
  30. 30.
    Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Zhang N, Anthony S, Liu H, Murthy R. Hive – a petabyte scale data warehouse using Hadoop. In: Proceedings of the 26th International Conference on Data Engineering; 2010.Google Scholar
  31. 31.
  32. 32.
    Wanderman-Milne S, Li N. Runtime code generation in Cloudera Impala. IEEE Data Eng Bull. 2014;37(1):31–7.Google Scholar
  33. 33.
    Xin RS, Rosen J, Zaharia M, Franklin MJ, Shenker S, Stoica I. Shark: SQL and rich analytics at scale. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2013.Google Scholar
  34. 34.
    Zuzarte C, Pirahesh H, Ma W, Cheng Q, Liu L, Wong K. WinMagic: subquery elimination using window aggregation. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2003.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.IBM Research – AlmadenSan JoseUSA
  2. 2.Carnegie Mellon UniversityPittsburghUSA
  3. 3.Amazon Web ServicesSeattleUSA