Definitions
Caching for SQL-on-Hadoop are techniques and systems which store data to provide faster access to that data, for Structured Query Language (SQL) engines running on the Apache Hadoop ecosystem.
Overview
The Apache Hadoop software project (Apache Hadoop 2018) has grown in popularity for distributed computing and big data. The Hadoop stack is widely used for storing large amounts of data, and for large-scale, distributed, and fault-tolerant data processing of that data. The Hadoop ecosystem has been important for organizations to extract actionable insight from the large volumes of collected data, which is difficult or infeasible for traditional data processing methods.
The main storage system for Hadoop is the Hadoop Distributed File System (HDFS). It is a distributed storage system which provides fault-tolerant and scalable storage. The main data processing framework for Hadoop is MapReduce, which is based on the Google MapReduce project (Dean and Ghemawat 2008). MapReduce...
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alluxio (2018) Alluxio – open source memory speed virtual distributed storage. https://www.alluxio.org/. Accessed 19 Mar 2018
Apache Drill (2018) Apache Drill. https://drill.apache.org. Accessed 19 Mar 2018
Apache Hadoop (2018) Welcome to Apache Hadoop! http://hadoop.apache.org. Accessed 19 Mar 2018
Apache Hadoop HDFS (2018) Centralized cache management in HDFS. https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html. Accessed 19 Mar 2018
Apache Hive (2018) Apache Hive. https://hive.apache.org. Accessed 19 Mar 2018
Apache Hive LLAP (2018) LLAP. https://cwiki.apache.org/confluence/display/Hive/LLAP. Accessed 19 Mar 2018
Apache Ignite (2018) Apache Ignite. https://ignite.apache.org/index.html. Accessed 19 Mar 2018
Apache Impala (2018) Apache Impala. https://impala.apache.org. Accessed 19 Mar 2018
Apache Spark SQL (2018) Spark SQL. https://spark.apache.org/sql/. Accessed 19 Mar 2018
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Facebook (2018) Presto. https://prestodb.io/. Accessed 19 Mar 2018
Floratou A et al (2016) Adaptive caching in big SQL using the HDFS cache. In: SoCC’16 proceedings of the seventh ACM symposium on cloud computing, Snata Clara, 5–7 Oct 2016
Spark RDD (2018) RDD programming guide. http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence. Accessed 19 Mar 2018
Spark SQL (2018) Spark SQL, dataframes and datasets guide. http://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory. Accessed 19 Mar 2018
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Section Editor information
Rights and permissions
Copyright information
© 2019 Springer International Publishing AG, part of Springer Nature
About this entry
Cite this entry
Pang, G., Li, H. (2019). Caching for SQL-on-Hadoop. In: Sakr, S., Zomaya, A.Y. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-77525-8_249
Download citation
DOI: https://doi.org/10.1007/978-3-319-77525-8_249
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77524-1
Online ISBN: 978-3-319-77525-8
eBook Packages: Computer ScienceReference Module Computer Science and Engineering