Skip to main content

Partitioning and Bucketing in Hive-Based Big Data Warehouses

  • Conference paper
  • First Online:

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 746))

Abstract

Hive is a tool that allows the implementation of Data Warehouses for Big Data contexts, organizing data into tables, partitions and buckets. Some studies have been conducted to understand ways of optimizing the performance of data storage and processing techniques/technologies for Big Data Warehouses. However, few of these studies explore whether the way data is structured has any influence on how Hive responds to queries. Thus, this work investigates the impact of creating partitions and buckets in the processing times of Hive-based Big Data Warehouses. The results obtained with the application of different modelling and organization strategies in Hive reinforces the advantages associated to the implementation of Big Data Warehouses based on denormalized models and, also, the potential benefit of adequate partitioning that, once aligned with the filters frequently applied on data, can significantly decrease the processing times. In contrast, the use of bucketing techniques has no evidence of significant advantages.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. De Mauro, A., Greco, M., Grimaldi, M.: What is Big Data? a consensual definition and a review of key research topics. In: AIP Conference Proceedings, pp. 97–104. AIP Publishing (2015)

    Google Scholar 

  2. Krishnan, K.: Data Warehousing in the Age of Big Data. Elsevier Inc., Netherlands (2013)

    Chapter  Google Scholar 

  3. Zikopoulos, P., Eaton, C.: Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. McGraw-Hill Osborne Media (2011)

    Google Scholar 

  4. Philip Chen, C.L., Zhang, C.Y.: Data-intensive applications, challenges, techniques and technologies: a survey on Big Data. Inf. Sci. (Ny) 275, 314–347 (2014). https://doi.org/10.1016/j.ins.2014.01.015

    Article  Google Scholar 

  5. Di Tria, F., Lefons, E., Tangorra, F.: Design process for Big Data warehouses. In: 2014 International Conference on Data Science and Advanced Analytics (DSAA), pp. 512–518. IEEE (2014)

    Google Scholar 

  6. Apache: Apache Hadoop. http://hadoop.apache.org/

  7. Cassavia, N., Dicosta, P., Masciari, E., Saccà, D.: Data preparation for tourist data Big Data warehousing. In: Proceedings of 3rd International Conference on Data Management Technologies and Applications (DATA), pp. 419–426. SciTePress (2014)

    Google Scholar 

  8. Sandoval, L.J.: Design of business intelligence applications using Big Data technology. In: 2015 IEEE Thirty Fifth Central American and Panama Convention (CONCAPAN XXXV), pp. 1–6. Institute of Electrical and Electronics Engineers Inc. (2016)

    Google Scholar 

  9. Santos, M.Y., Costa, C.: Data warehousing in Big Data: from multidimensional to tabular data models. In: C3S2E 2016 - Ninth International C* Conference on Computer Science & Software Engineering, p. 10 (2016)

    Google Scholar 

  10. Santos, M.Y., Costa, C.: Data models in NoSQL databases for Big Data contexts. In: Tan, Y., Shi, Y. (eds.) International Conference on Data Mining and Big Data, pp. 475–485. Springer International Publishing, Cham (2016)

    Google Scholar 

  11. Kimball, R., Ross, M.: The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd edn. Wiley, Hoboken (2013)

    Google Scholar 

  12. Mohanty, S., Jagadeesh, M., Srivatsa, H.: Big Data Imperatives: Enterprise Big Data Warehouse, BI Implementations and Analytics. Apress (2013)

    Chapter  Google Scholar 

  13. Santos, M.Y., Costa, C., Galvão, J., Andrade, C., Martinho, B.A., Lima, F.V., Costa, E.: Evaluating SQL-on-hadoop for Big Data warehousing on not-so-good hardware. In: Proceedings of the 21st International Database Engineering & Applications Symposium, pp. 242–252. ACM, New York (2017)

    Google Scholar 

  14. Thusoo, A., Sen Sarma, J., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive - a petabyte scale data warehouse using hadoop. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE), pp. 996–1005. IEEE (2010)

    Google Scholar 

  15. Capriolo, E., Wampler, D., Rutherglen, J.: Programming Hive. O’Reilly Media, Inc., USA (2012)

    Google Scholar 

  16. Costa, E., Costa, C., Santos, M.Y.: Efficient Big Data modelling and organization for hadoop hive-based data warehouses. In: Themistocleous, M., Morabito, V. (eds.) 14th European, Mediterranean, and Middle Eastern Conference (EMCIS), pp. 3–16. Springer International Publishing, Coimbra (2017)

    Google Scholar 

  17. Chavalier, M., El Malki, M., Kopliku, A., Teste, O., Tournier, R.: Document-oriented data warehouses: models and extended cuboids. In: 10th International Conference on Research Challenges in Information Science (RCIS), pp. 1–11. IEEE (2016)

    Google Scholar 

  18. Yangui, R., Nabli, A., Gargouri, F.: Automatic transformation of data warehouse schema to NoSQL data base: comparative study. Procedia Comput. Sci. 96, 255–264 (2016)

    Article  Google Scholar 

  19. Shaw, S., Vermeulen, A.F., Gupta, A., Kjerrumgaard, D.: Practical Hive: A Guide to Hadoop’s Data Warehouse System. Apress (2016)

    Chapter  Google Scholar 

  20. Du, D.: Apache Hive Essentials. Packt Publishing Ltd., Birmingham (2015)

    Google Scholar 

  21. Hortonworks, I.: Hortonworks. https://hortonworks.com

Download references

Acknowledgments

This work is supported by COMPETE: POCI-01-0145- FEDER-007043 and FCT – Fundação para a Ciência e Tecnologia within the Project Scope: UID/CEC/00319/2013, and by European Structural and Investment Funds in the FEDER component, through the Operational Competitiveness and Internationalization Programme (COMPETE 2020) [Project nº 002814; Funding Reference: POCI-01-0247-FEDER-002814].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eduarda Costa .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Costa, E., Costa, C., Santos, M.Y. (2018). Partitioning and Bucketing in Hive-Based Big Data Warehouses. In: Rocha, Á., Adeli, H., Reis, L., Costanzo, S. (eds) Trends and Advances in Information Systems and Technologies. WorldCIST'18 2018. Advances in Intelligent Systems and Computing, vol 746. Springer, Cham. https://doi.org/10.1007/978-3-319-77712-2_72

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-77712-2_72

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-77711-5

  • Online ISBN: 978-3-319-77712-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics