Caching for SQL-on-Hadoop

Pang, Gene; Li, Haoyuan

doi:10.1007/978-3-319-77525-8_249

Gene Pang³ &
Haoyuan Li³

86 Accesses

Definitions

Caching for SQL-on-Hadoop are techniques and systems which store data to provide faster access to that data, for Structured Query Language (SQL) engines running on the Apache Hadoop ecosystem.

Overview

The Apache Hadoop software project (Apache Hadoop 2018) has grown in popularity for distributed computing and big data. The Hadoop stack is widely used for storing large amounts of data, and for large-scale, distributed, and fault-tolerant data processing of that data. The Hadoop ecosystem has been important for organizations to extract actionable insight from the large volumes of collected data, which is difficult or infeasible for traditional data processing methods.

The main storage system for Hadoop is the Hadoop Distributed File System (HDFS). It is a distributed storage system which provides fault-tolerant and scalable storage. The main data processing framework for Hadoop is MapReduce, which is based on the Google MapReduce project (Dean and Ghemawat 2008). MapReduce...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 849.99; Price excludes VAT (USA)

Hardcover Book: USD 999.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alluxio (2018) Alluxio – open source memory speed virtual distributed storage. https://www.alluxio.org/. Accessed 19 Mar 2018
Apache Drill (2018) Apache Drill. https://drill.apache.org. Accessed 19 Mar 2018
Apache Hadoop (2018) Welcome to Apache Hadoop! http://hadoop.apache.org. Accessed 19 Mar 2018
Apache Hadoop HDFS (2018) Centralized cache management in HDFS. https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html. Accessed 19 Mar 2018
Apache Hive (2018) Apache Hive. https://hive.apache.org. Accessed 19 Mar 2018
Apache Hive LLAP (2018) LLAP. https://cwiki.apache.org/confluence/display/Hive/LLAP. Accessed 19 Mar 2018
Apache Ignite (2018) Apache Ignite. https://ignite.apache.org/index.html. Accessed 19 Mar 2018
Apache Impala (2018) Apache Impala. https://impala.apache.org. Accessed 19 Mar 2018
Apache Spark SQL (2018) Spark SQL. https://spark.apache.org/sql/. Accessed 19 Mar 2018
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Facebook (2018) Presto. https://prestodb.io/. Accessed 19 Mar 2018
Floratou A et al (2016) Adaptive caching in big SQL using the HDFS cache. In: SoCC’16 proceedings of the seventh ACM symposium on cloud computing, Snata Clara, 5–7 Oct 2016
Google Scholar
Spark RDD (2018) RDD programming guide. http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence. Accessed 19 Mar 2018
Spark SQL (2018) Spark SQL, dataframes and datasets guide. http://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory. Accessed 19 Mar 2018

Download references

Author information

Authors and Affiliations

Alluxio Inc., San Mateo, CA, USA
Gene Pang & Haoyuan Li

Authors

Gene Pang
View author publications
You can also search for this author in PubMed Google Scholar
Haoyuan Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gene Pang .

Editor information

Editors and Affiliations

Institute of Computer Science, University of Tartu, Tartu, Estonia
Sherif Sakr
School of Information Technologies, Sydney University, Sydney, Australia
Albert Y. Zomaya

Section Editor information

IBM Almaden Research Center, SAN JOSE, CA, USA
Yuanyuan Tian
IBM Research – Almaden, San Jose, CA, USA
Fatma Özcan

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Pang, G., Li, H. (2019). Caching for SQL-on-Hadoop. In: Sakr, S., Zomaya, A.Y. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-77525-8_249

Download citation

DOI: https://doi.org/10.1007/978-3-319-77525-8_249
Published: 20 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77524-1
Online ISBN: 978-3-319-77525-8
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics