Columnar Storage Formats

Floratou, Avrilia

doi:10.1007/978-3-319-77525-8_248

Avrilia Floratou³

122 Accesses

Definitions

Row Storage::: A data layout that contiguously stores the values belonging to the columns that make up the entire row.
Columnar Storage::: A data layout that contiguously stores values belonging to the same column for multiple rows.

Overview

Fast analytics over Hadoop data has gained significant traction over the last few years, as multiple enterprises are using Hadoop to store data coming from various sources including operational systems, sensors and mobile devices, and web applications. Various Big Data frameworks have been developed to support fast analytics on top of this data and to provide insights in near real time.

A crucial aspect in delivering high performance in such large-scale environments is the underlying data layout. Most Big Data frameworks are designed to operate on top of data stored in various formats, and they are extensible enough to incorporate new data formats. Over the years, a plethora of open-source data formats have been designed to support the...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 849.99; Price excludes VAT (USA)

Hardcover Book: USD 999.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ailamaki A, DeWitt DJ, Hill MD, Skounakis M (2001) Weaving relations for cache performance. In: Proceedings of the 27th international conference on very large data bases (VLDB’01), pp 169–180
Google Scholar
Apache Arrow (2017) Apache Arrow. https://arrow.apache.org/
Apache Hadoop (2017) Apache Hadoop. http://hadoop.apache.org
Apache Hadoop HDFS (2017) Apache Hadoop HDFS. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
Apache Hbase (2017) Apache HBase. https://hbase.apache.org/
Apache Hive (2017) Apache Hive. https://hive.apache.org/
Apache Kudu (2017) Apache Kudu. https://kudu.apache.org/
Apache ORC (2017) Apache ORC. https://orc.apache.org/
Apache Parquet (2017) Apache Parquet. https://parquet.apache.org/
Apache Pig (2017) Apache Pig. https://pig.apache.org/
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Floratou A, Patel JM, Shekita EJ, Tata S (2011) Column-oriented Storage Techniques for MapReduce. Proc VLDB Endow 4(7):419–429
Article Google Scholar
Floratou A, Minhas UF, Özcan F (2014) SQL-on-Hadoop: full circle back to shared-nothing database architectures. Proc VLDB Endow 7(12):1295–1306
Article Google Scholar
He Y, Lee R, Huai Y, Shao Z, Jain N, Zhang X, Xu Z (2011) RCFile: a fast and space-efficient data placement structure in MapReduce-based warehouse systems. In: Proceedings of the 2011 IEEE 27th international conference on data engineering (ICDE’11). IEEE Computer Society, pp 1199–1208
Google Scholar
Huai Y, Ma S, Lee R, O’Malley O, Zhang X (2013) Understanding insights into the basic structure and essential issues of table placement methods in clusters. Proc VLDB Endow 6(14):1750–1761
Article Google Scholar
Kornacker M, Behm A, Bittorf V, Bobrovytsky T, Ching C, Choi A, Erickson J, Grund M, Hecht D, Jacobs M, Joshi I, Kuff L, Kumar D, Leblang A, Li N, Pandis I, Robinson H, Rorke D, Rus S, Russell J, Tsirogiannis D, Wanderman-Milne S, Yoder M (2015) Impala: a modern, open-source SQL engine for hadoop. In: CIDR
Google Scholar
Melnik S, Gubarev A, Long JJ, Romer G, Shivakumar S, Tolton M, Vassilakis T (2010) Dremel: interactive analysis of web-scale datasets. Proc VLDB Endow 3(1–2):330–339
Article Google Scholar
ORC Encodings (2017) ORC Encodings. https://orc.apache.org/docs/run-length.html
ORC Index (2017) ORC Index. https://orc.apache.org/docs/spec-index.html
Parquet Encodings (2017) Parquet Encodings. https:// github.com/apache/parquet-format/blob/master/Encodi ngs.md
Snappy Compression (2017) Snappy Compression. https://en.wikipedia.org/wiki/Snappy_(compression)
Vertica (2017) Vertica. https://www.vertica.com/
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. NSDI, USENIX
Google Scholar
ZLIB Compression (2017) ZLIB Compression. https://en.wikipedia.org/wiki/Zlib

Download references

Author information

Authors and Affiliations

Microsoft, Sunnyvale, CA, USA
Avrilia Floratou

Authors

Avrilia Floratou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Avrilia Floratou .

Editor information

Editors and Affiliations

Institute of Computer Science, University of Tartu, Tartu, Estonia
Sherif Sakr
School of Information Technologies, Sydney University, Sydney, Australia
Albert Y. Zomaya

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Floratou, A. (2019). Columnar Storage Formats. In: Sakr, S., Zomaya, A.Y. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-77525-8_248

Download citation

DOI: https://doi.org/10.1007/978-3-319-77525-8_248
Published: 20 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77524-1
Online ISBN: 978-3-319-77525-8
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics