Advertisement

An Efficient Data Extracting Method Based on Hadoop

  • Lianchao CaoEmail author
  • Zhanqiang Li
  • Kaiyuan Qi
  • Guomao Xin
  • Dong Zhang
Conference paper
Part of the Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering book series (LNICST, volume 142)

Abstract

As an open-source big data solutions, Hadoop ecosystem have been widely accepted and applied. However, how to import large amounts of data in a short time from the traditional relational database to hadoop become a major challenge for ETL (Extract-Transform-Load)stage of big data processing. This paper presents an efficient parallel data extraction method based on hadoop, using MapReduce computation engine to call JDBC(The Java Database Connectivity) interface for data extraction. Among them, for the problem of multi-Map segmentation during the data input, this paper presents a dynamic segmentation algorithm for Map input based on range partition, can effectively avoid data tilt, making the input data is distributed more uniform in each Map. Experimental results show that the proposed method with respect to the ETL tool Sqoop which also using the same calculation engine of MapReduce is more uniform in dividing the input data and take less time when extract same datas.

Keywords

ETL Hadoop MapReduce Big data Range Partition 

Notes

Acknowledgements

This work is supported by the Core Electronic Devices, High-end Generic Chips and Basic Software of National Science and Technology Major Projects of China,No.2013ZX01039002.

References

  1. 1.
    Ghemawat, S., Gobioff, H., Leung, S.-T.: The google file system. ACM SIGOPS Oper. Syst. Rev. 37(5), 29–43 (2003). ACMCrossRefGoogle Scholar
  2. 2.
    White, T.: Hadoop: The Definitive Guide. O’Reilly Media Inc., Sebastopol (2012)Google Scholar
  3. 3.
    Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al.: Apache hadoop yarn: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, p. 5. ACM (2013)Google Scholar
  4. 4.
  5. 5.
    Casters, M., Bouman, R., Van Dongen, J.: Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration. Wiley, Indianapolis (2010)Google Scholar
  6. 6.
  7. 7.
    Azarmi, B.: Talend for Big Data. Packt Publishing Ltd, Birmingham (2014)Google Scholar
  8. 8.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  9. 9.
    Jain, N., Liao, G., Willke, T.L.: Graphbuilder: scalable graph etl framework. In: First International Workshop on Graph Data Management Experiences and Systems, ACM (2013)Google Scholar
  10. 10.
    Liu, X., Thomsen, C., Pedersen, T.B.: ETLMR: a highly scalable dimensional ETL framework based on mapreduce. In: Cuzzocrea, A., Dayal, U. (eds.) DaWaK 2011. LNCS, vol. 6862, pp. 96–111. Springer, Heidelberg (2011) CrossRefGoogle Scholar
  11. 11.
    Chen, J., Le, J.: Programming model based on mapreduce for importing big table into hdfs. J. Comput. Appl. 33(9), 2486–2489, 2561 (2013)Google Scholar
  12. 12.
  13. 13.
    Ting, K., Cecho, J.J.: Apache Sqoop Cookbook. O’Reilly Media, Inc., CA (2013)Google Scholar

Copyright information

© Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2015

Authors and Affiliations

  • Lianchao Cao
    • 1
    Email author
  • Zhanqiang Li
    • 1
  • Kaiyuan Qi
    • 1
  • Guomao Xin
    • 1
  • Dong Zhang
    • 1
  1. 1.State Key Laboratory of High-end Server and Storage TechnologySystem Soft Department of InspurJinanChina

Personalised recommendations