Partitioning and Bucketing in Hive-Based Big Data Warehouses

  • Eduarda Costa
  • Carlos Costa
  • Maribel Yasmina Santos
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 746)

Abstract

Hive is a tool that allows the implementation of Data Warehouses for Big Data contexts, organizing data into tables, partitions and buckets. Some studies have been conducted to understand ways of optimizing the performance of data storage and processing techniques/technologies for Big Data Warehouses. However, few of these studies explore whether the way data is structured has any influence on how Hive responds to queries. Thus, this work investigates the impact of creating partitions and buckets in the processing times of Hive-based Big Data Warehouses. The results obtained with the application of different modelling and organization strategies in Hive reinforces the advantages associated to the implementation of Big Data Warehouses based on denormalized models and, also, the potential benefit of adequate partitioning that, once aligned with the filters frequently applied on data, can significantly decrease the processing times. In contrast, the use of bucketing techniques has no evidence of significant advantages.

Keywords

Big Data Big Data Warehouse Hive Partitions Buckets 

Notes

Acknowledgments

This work is supported by COMPETE: POCI-01-0145- FEDER-007043 and FCT – Fundação para a Ciência e Tecnologia within the Project Scope: UID/CEC/00319/2013, and by European Structural and Investment Funds in the FEDER component, through the Operational Competitiveness and Internationalization Programme (COMPETE 2020) [Project nº 002814; Funding Reference: POCI-01-0247-FEDER-002814].

References

  1. 1.
    De Mauro, A., Greco, M., Grimaldi, M.: What is Big Data? a consensual definition and a review of key research topics. In: AIP Conference Proceedings, pp. 97–104. AIP Publishing (2015)Google Scholar
  2. 2.
    Krishnan, K.: Data Warehousing in the Age of Big Data. Elsevier Inc., Netherlands (2013)CrossRefGoogle Scholar
  3. 3.
    Zikopoulos, P., Eaton, C.: Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. McGraw-Hill Osborne Media (2011)Google Scholar
  4. 4.
    Philip Chen, C.L., Zhang, C.Y.: Data-intensive applications, challenges, techniques and technologies: a survey on Big Data. Inf. Sci. (Ny) 275, 314–347 (2014).  https://doi.org/10.1016/j.ins.2014.01.015CrossRefGoogle Scholar
  5. 5.
    Di Tria, F., Lefons, E., Tangorra, F.: Design process for Big Data warehouses. In: 2014 International Conference on Data Science and Advanced Analytics (DSAA), pp. 512–518. IEEE (2014)Google Scholar
  6. 6.
    Apache: Apache Hadoop. http://hadoop.apache.org/
  7. 7.
    Cassavia, N., Dicosta, P., Masciari, E., Saccà, D.: Data preparation for tourist data Big Data warehousing. In: Proceedings of 3rd International Conference on Data Management Technologies and Applications (DATA), pp. 419–426. SciTePress (2014)Google Scholar
  8. 8.
    Sandoval, L.J.: Design of business intelligence applications using Big Data technology. In: 2015 IEEE Thirty Fifth Central American and Panama Convention (CONCAPAN XXXV), pp. 1–6. Institute of Electrical and Electronics Engineers Inc. (2016)Google Scholar
  9. 9.
    Santos, M.Y., Costa, C.: Data warehousing in Big Data: from multidimensional to tabular data models. In: C3S2E 2016 - Ninth International C* Conference on Computer Science & Software Engineering, p. 10 (2016)Google Scholar
  10. 10.
    Santos, M.Y., Costa, C.: Data models in NoSQL databases for Big Data contexts. In: Tan, Y., Shi, Y. (eds.) International Conference on Data Mining and Big Data, pp. 475–485. Springer International Publishing, Cham (2016)Google Scholar
  11. 11.
    Kimball, R., Ross, M.: The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd edn. Wiley, Hoboken (2013)Google Scholar
  12. 12.
    Mohanty, S., Jagadeesh, M., Srivatsa, H.: Big Data Imperatives: Enterprise Big Data Warehouse, BI Implementations and Analytics. Apress (2013)CrossRefGoogle Scholar
  13. 13.
    Santos, M.Y., Costa, C., Galvão, J., Andrade, C., Martinho, B.A., Lima, F.V., Costa, E.: Evaluating SQL-on-hadoop for Big Data warehousing on not-so-good hardware. In: Proceedings of the 21st International Database Engineering & Applications Symposium, pp. 242–252. ACM, New York (2017)Google Scholar
  14. 14.
    Thusoo, A., Sen Sarma, J., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive - a petabyte scale data warehouse using hadoop. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE), pp. 996–1005. IEEE (2010)Google Scholar
  15. 15.
    Capriolo, E., Wampler, D., Rutherglen, J.: Programming Hive. O’Reilly Media, Inc., USA (2012)Google Scholar
  16. 16.
    Costa, E., Costa, C., Santos, M.Y.: Efficient Big Data modelling and organization for hadoop hive-based data warehouses. In: Themistocleous, M., Morabito, V. (eds.) 14th European, Mediterranean, and Middle Eastern Conference (EMCIS), pp. 3–16. Springer International Publishing, Coimbra (2017)Google Scholar
  17. 17.
    Chavalier, M., El Malki, M., Kopliku, A., Teste, O., Tournier, R.: Document-oriented data warehouses: models and extended cuboids. In: 10th International Conference on Research Challenges in Information Science (RCIS), pp. 1–11. IEEE (2016)Google Scholar
  18. 18.
    Yangui, R., Nabli, A., Gargouri, F.: Automatic transformation of data warehouse schema to NoSQL data base: comparative study. Procedia Comput. Sci. 96, 255–264 (2016)CrossRefGoogle Scholar
  19. 19.
    Shaw, S., Vermeulen, A.F., Gupta, A., Kjerrumgaard, D.: Practical Hive: A Guide to Hadoop’s Data Warehouse System. Apress (2016)CrossRefGoogle Scholar
  20. 20.
    Du, D.: Apache Hive Essentials. Packt Publishing Ltd., Birmingham (2015)Google Scholar
  21. 21.
    Hortonworks, I.: Hortonworks. https://hortonworks.com

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.ALGORITMI Research CentreUniversity of MinhoGuimarãesPortugal
  2. 2.Center for Computer GraphicsUniversity of MinhoGuimarãesPortugal

Personalised recommendations