Columnar Storage Formats
- Row Storage:
A data layout that contiguously stores the values belonging to the columns that make up the entire row.
- Columnar Storage:
A data layout that contiguously stores values belonging to the same column for multiple rows.
Fast analytics over Hadoop data has gained significant traction over the last few years, as multiple enterprises are using Hadoop to store data coming from various sources including operational systems, sensors and mobile devices, and web applications. Various Big Data frameworks have been developed to support fast analytics on top of this data and to provide insights in near real time.
A crucial aspect in delivering high performance in such large-scale environments is the underlying data layout. Most Big Data frameworks are designed to operate on top of data stored in various formats, and they are extensible enough to incorporate new data formats. Over the years, a plethora of open-source data formats have been designed to support the...
- Ailamaki A, DeWitt DJ, Hill MD, Skounakis M (2001) Weaving relations for cache performance. In: Proceedings of the 27th international conference on very large data bases (VLDB’01), pp 169–180Google Scholar
- Apache Arrow (2017) Apache Arrow. https://arrow.apache.org/
- Apache Hadoop (2017) Apache Hadoop. http://hadoop.apache.org
- Apache Hadoop HDFS (2017) Apache Hadoop HDFS. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
- Apache Hbase (2017) Apache HBase. https://hbase.apache.org/
- Apache Hive (2017) Apache Hive. https://hive.apache.org/
- Apache Kudu (2017) Apache Kudu. https://kudu.apache.org/
- Apache ORC (2017) Apache ORC. https://orc.apache.org/
- Apache Parquet (2017) Apache Parquet. https://parquet.apache.org/
- Apache Pig (2017) Apache Pig. https://pig.apache.org/
- He Y, Lee R, Huai Y, Shao Z, Jain N, Zhang X, Xu Z (2011) RCFile: a fast and space-efficient data placement structure in MapReduce-based warehouse systems. In: Proceedings of the 2011 IEEE 27th international conference on data engineering (ICDE’11). IEEE Computer Society, pp 1199–1208Google Scholar
- Kornacker M, Behm A, Bittorf V, Bobrovytsky T, Ching C, Choi A, Erickson J, Grund M, Hecht D, Jacobs M, Joshi I, Kuff L, Kumar D, Leblang A, Li N, Pandis I, Robinson H, Rorke D, Rus S, Russell J, Tsirogiannis D, Wanderman-Milne S, Yoder M (2015) Impala: a modern, open-source SQL engine for hadoop. In: CIDRGoogle Scholar
- ORC Encodings (2017) ORC Encodings. https://orc.apache.org/docs/run-length.html
- ORC Index (2017) ORC Index. https://orc.apache.org/docs/spec-index.html
- Parquet Encodings (2017) Parquet Encodings. https:// github.com/apache/parquet-format/blob/master/Encodi ngs.md
- Snappy Compression (2017) Snappy Compression. https://en.wikipedia.org/wiki/Snappy_(compression)
- Vertica (2017) Vertica. https://www.vertica.com/
- Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. NSDI, USENIXGoogle Scholar
- ZLIB Compression (2017) ZLIB Compression. https://en.wikipedia.org/wiki/Zlib