Definitions
- Row Storage::
-
A data layout that contiguously stores the values belonging to the columns that make up the entire row.
- Columnar Storage::
-
A data layout that contiguously stores values belonging to the same column for multiple rows.
Overview
Fast analytics over Hadoop data has gained significant traction over the last few years, as multiple enterprises are using Hadoop to store data coming from various sources including operational systems, sensors and mobile devices, and web applications. Various Big Data frameworks have been developed to support fast analytics on top of this data and to provide insights in near real time.
A crucial aspect in delivering high performance in such large-scale environments is the underlying data layout. Most Big Data frameworks are designed to operate on top of data stored in various formats, and they are extensible enough to incorporate new data formats. Over the years, a plethora of open-source data formats have been designed to support the...
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ailamaki A, DeWitt DJ, Hill MD, Skounakis M (2001) Weaving relations for cache performance. In: Proceedings of the 27th international conference on very large data bases (VLDB’01), pp 169–180
Apache Arrow (2017) Apache Arrow. https://arrow.apache.org/
Apache Hadoop (2017) Apache Hadoop. http://hadoop.apache.org
Apache Hadoop HDFS (2017) Apache Hadoop HDFS. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
Apache Hbase (2017) Apache HBase. https://hbase.apache.org/
Apache Hive (2017) Apache Hive. https://hive.apache.org/
Apache Kudu (2017) Apache Kudu. https://kudu.apache.org/
Apache ORC (2017) Apache ORC. https://orc.apache.org/
Apache Parquet (2017) Apache Parquet. https://parquet.apache.org/
Apache Pig (2017) Apache Pig. https://pig.apache.org/
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Floratou A, Patel JM, Shekita EJ, Tata S (2011) Column-oriented Storage Techniques for MapReduce. Proc VLDB Endow 4(7):419–429
Floratou A, Minhas UF, Özcan F (2014) SQL-on-Hadoop: full circle back to shared-nothing database architectures. Proc VLDB Endow 7(12):1295–1306
He Y, Lee R, Huai Y, Shao Z, Jain N, Zhang X, Xu Z (2011) RCFile: a fast and space-efficient data placement structure in MapReduce-based warehouse systems. In: Proceedings of the 2011 IEEE 27th international conference on data engineering (ICDE’11). IEEE Computer Society, pp 1199–1208
Huai Y, Ma S, Lee R, O’Malley O, Zhang X (2013) Understanding insights into the basic structure and essential issues of table placement methods in clusters. Proc VLDB Endow 6(14):1750–1761
Kornacker M, Behm A, Bittorf V, Bobrovytsky T, Ching C, Choi A, Erickson J, Grund M, Hecht D, Jacobs M, Joshi I, Kuff L, Kumar D, Leblang A, Li N, Pandis I, Robinson H, Rorke D, Rus S, Russell J, Tsirogiannis D, Wanderman-Milne S, Yoder M (2015) Impala: a modern, open-source SQL engine for hadoop. In: CIDR
Melnik S, Gubarev A, Long JJ, Romer G, Shivakumar S, Tolton M, Vassilakis T (2010) Dremel: interactive analysis of web-scale datasets. Proc VLDB Endow 3(1–2):330–339
ORC Encodings (2017) ORC Encodings. https://orc.apache.org/docs/run-length.html
ORC Index (2017) ORC Index. https://orc.apache.org/docs/spec-index.html
Parquet Encodings (2017) Parquet Encodings. https:// github.com/apache/parquet-format/blob/master/Encodi ngs.md
Snappy Compression (2017) Snappy Compression. https://en.wikipedia.org/wiki/Snappy_(compression)
Vertica (2017) Vertica. https://www.vertica.com/
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. NSDI, USENIX
ZLIB Compression (2017) ZLIB Compression. https://en.wikipedia.org/wiki/Zlib
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this entry
Cite this entry
Floratou, A. (2019). Columnar Storage Formats. In: Sakr, S., Zomaya, A.Y. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-77525-8_248
Download citation
DOI: https://doi.org/10.1007/978-3-319-77525-8_248
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77524-1
Online ISBN: 978-3-319-77525-8
eBook Packages: Computer ScienceReference Module Computer Science and Engineering