Partitioning and Bucketing in Hive-Based Big Data Warehouses

Costa, Eduarda; Costa, Carlos; Santos, Maribel Yasmina

doi:10.1007/978-3-319-77712-2_72

Partitioning and Bucketing in Hive-Based Big Data Warehouses

Conference paper
First Online: 17 May 2018

2624 Accesses
2 Citations
1 Altmetric

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 746))

Abstract

Hive is a tool that allows the implementation of Data Warehouses for Big Data contexts, organizing data into tables, partitions and buckets. Some studies have been conducted to understand ways of optimizing the performance of data storage and processing techniques/technologies for Big Data Warehouses. However, few of these studies explore whether the way data is structured has any influence on how Hive responds to queries. Thus, this work investigates the impact of creating partitions and buckets in the processing times of Hive-based Big Data Warehouses. The results obtained with the application of different modelling and organization strategies in Hive reinforces the advantages associated to the implementation of Big Data Warehouses based on denormalized models and, also, the potential benefit of adequate partitioning that, once aligned with the filters frequently applied on data, can significantly decrease the processing times. In contrast, the use of bucketing techniques has no evidence of significant advantages.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

De Mauro, A., Greco, M., Grimaldi, M.: What is Big Data? a consensual definition and a review of key research topics. In: AIP Conference Proceedings, pp. 97–104. AIP Publishing (2015)
Google Scholar
Krishnan, K.: Data Warehousing in the Age of Big Data. Elsevier Inc., Netherlands (2013)
Chapter Google Scholar
Zikopoulos, P., Eaton, C.: Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. McGraw-Hill Osborne Media (2011)
Google Scholar
Philip Chen, C.L., Zhang, C.Y.: Data-intensive applications, challenges, techniques and technologies: a survey on Big Data. Inf. Sci. (Ny) 275, 314–347 (2014). https://doi.org/10.1016/j.ins.2014.01.015
Article Google Scholar
Di Tria, F., Lefons, E., Tangorra, F.: Design process for Big Data warehouses. In: 2014 International Conference on Data Science and Advanced Analytics (DSAA), pp. 512–518. IEEE (2014)
Google Scholar
Apache: Apache Hadoop. http://hadoop.apache.org/
Cassavia, N., Dicosta, P., Masciari, E., Saccà, D.: Data preparation for tourist data Big Data warehousing. In: Proceedings of 3rd International Conference on Data Management Technologies and Applications (DATA), pp. 419–426. SciTePress (2014)
Google Scholar
Sandoval, L.J.: Design of business intelligence applications using Big Data technology. In: 2015 IEEE Thirty Fifth Central American and Panama Convention (CONCAPAN XXXV), pp. 1–6. Institute of Electrical and Electronics Engineers Inc. (2016)
Google Scholar
Santos, M.Y., Costa, C.: Data warehousing in Big Data: from multidimensional to tabular data models. In: C3S2E 2016 - Ninth International C* Conference on Computer Science & Software Engineering, p. 10 (2016)
Google Scholar
Santos, M.Y., Costa, C.: Data models in NoSQL databases for Big Data contexts. In: Tan, Y., Shi, Y. (eds.) International Conference on Data Mining and Big Data, pp. 475–485. Springer International Publishing, Cham (2016)
Google Scholar
Kimball, R., Ross, M.: The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd edn. Wiley, Hoboken (2013)
Google Scholar
Mohanty, S., Jagadeesh, M., Srivatsa, H.: Big Data Imperatives: Enterprise Big Data Warehouse, BI Implementations and Analytics. Apress (2013)
Chapter Google Scholar
Santos, M.Y., Costa, C., Galvão, J., Andrade, C., Martinho, B.A., Lima, F.V., Costa, E.: Evaluating SQL-on-hadoop for Big Data warehousing on not-so-good hardware. In: Proceedings of the 21st International Database Engineering & Applications Symposium, pp. 242–252. ACM, New York (2017)
Google Scholar
Thusoo, A., Sen Sarma, J., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive - a petabyte scale data warehouse using hadoop. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE), pp. 996–1005. IEEE (2010)
Google Scholar
Capriolo, E., Wampler, D., Rutherglen, J.: Programming Hive. O’Reilly Media, Inc., USA (2012)
Google Scholar
Costa, E., Costa, C., Santos, M.Y.: Efficient Big Data modelling and organization for hadoop hive-based data warehouses. In: Themistocleous, M., Morabito, V. (eds.) 14th European, Mediterranean, and Middle Eastern Conference (EMCIS), pp. 3–16. Springer International Publishing, Coimbra (2017)
Google Scholar
Chavalier, M., El Malki, M., Kopliku, A., Teste, O., Tournier, R.: Document-oriented data warehouses: models and extended cuboids. In: 10th International Conference on Research Challenges in Information Science (RCIS), pp. 1–11. IEEE (2016)
Google Scholar
Yangui, R., Nabli, A., Gargouri, F.: Automatic transformation of data warehouse schema to NoSQL data base: comparative study. Procedia Comput. Sci. 96, 255–264 (2016)
Article Google Scholar
Shaw, S., Vermeulen, A.F., Gupta, A., Kjerrumgaard, D.: Practical Hive: A Guide to Hadoop’s Data Warehouse System. Apress (2016)
Chapter Google Scholar
Du, D.: Apache Hive Essentials. Packt Publishing Ltd., Birmingham (2015)
Google Scholar
Hortonworks, I.: Hortonworks. https://hortonworks.com

Download references

Acknowledgments

This work is supported by COMPETE: POCI-01-0145- FEDER-007043 and FCT – Fundação para a Ciência e Tecnologia within the Project Scope: UID/CEC/00319/2013, and by European Structural and Investment Funds in the FEDER component, through the Operational Competitiveness and Internationalization Programme (COMPETE 2020) [Project nº 002814; Funding Reference: POCI-01-0247-FEDER-002814].

Author information

Authors and Affiliations

ALGORITMI Research Centre, University of Minho, 4800 058, Guimarães, Portugal
Eduarda Costa, Carlos Costa & Maribel Yasmina Santos
Center for Computer Graphics, University of Minho, 4800 058, Guimarães, Portugal
Carlos Costa

Authors

Eduarda Costa
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Costa
View author publications
You can also search for this author in PubMed Google Scholar
Maribel Yasmina Santos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eduarda Costa .

Editor information

Editors and Affiliations

Departamento de Engenharia Informática, Universidade de Coimbra, Coimbra, Portugal
Álvaro Rocha
College of Engineering, The Ohio State University, Columbus, Ohio, USA
Hojjat Adeli
DSI/EEUM, Universidade do Minho, Guimarães, Portugal
Luís Paulo Reis
DIMES, Universita della Calabria, Arcavacata di Rende, Italy
Sandra Costanzo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Costa, E., Costa, C., Santos, M.Y. (2018). Partitioning and Bucketing in Hive-Based Big Data Warehouses. In: Rocha, Á., Adeli, H., Reis, L., Costanzo, S. (eds) Trends and Advances in Information Systems and Technologies. WorldCIST'18 2018. Advances in Intelligent Systems and Computing, vol 746. Springer, Cham. https://doi.org/10.1007/978-3-319-77712-2_72

Download citation

DOI: https://doi.org/10.1007/978-3-319-77712-2_72
Published: 17 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77711-5
Online ISBN: 978-3-319-77712-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics