An Efficient Data Extracting Method Based on Hadoop

Cao, Lianchao; Li, Zhanqiang; Qi, Kaiyuan; Xin, Guomao; Zhang, Dong

doi:10.1007/978-3-319-16050-4_8

Lianchao Cao¹⁹,
Zhanqiang Li¹⁹,
Kaiyuan Qi¹⁹,
Guomao Xin¹⁹ &
…
Dong Zhang¹⁹

Part of the book series: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering ((LNICST,volume 142))

Included in the following conference series:

International Conference on Cloud Computing

1520 Accesses
1 Citations

Abstract

As an open-source big data solutions, Hadoop ecosystem have been widely accepted and applied. However, how to import large amounts of data in a short time from the traditional relational database to hadoop become a major challenge for ETL (Extract-Transform-Load)stage of big data processing. This paper presents an efficient parallel data extraction method based on hadoop, using MapReduce computation engine to call JDBC(The Java Database Connectivity) interface for data extraction. Among them, for the problem of multi-Map segmentation during the data input, this paper presents a dynamic segmentation algorithm for Map input based on range partition, can effectively avoid data tilt, making the input data is distributed more uniform in each Map. Experimental results show that the proposed method with respect to the ETL tool Sqoop which also using the same calculation engine of MapReduce is more uniform in dividing the input data and take less time when extract same datas.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ghemawat, S., Gobioff, H., Leung, S.-T.: The google file system. ACM SIGOPS Oper. Syst. Rev. 37(5), 29–43 (2003). ACM
Article Google Scholar
White, T.: Hadoop: The Definitive Guide. O’Reilly Media Inc., Sebastopol (2012)
Google Scholar
Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al.: Apache hadoop yarn: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, p. 5. ACM (2013)
Google Scholar
http://community.pentaho.com/projects/data-integration/
Casters, M., Bouman, R., Van Dongen, J.: Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration. Wiley, Indianapolis (2010)
Google Scholar
http://www.talend.com/products/talend-open-studio
Azarmi, B.: Talend for Big Data. Packt Publishing Ltd, Birmingham (2014)
Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Jain, N., Liao, G., Willke, T.L.: Graphbuilder: scalable graph etl framework. In: First International Workshop on Graph Data Management Experiences and Systems, ACM (2013)
Google Scholar
Liu, X., Thomsen, C., Pedersen, T.B.: ETLMR: a highly scalable dimensional ETL framework based on mapreduce. In: Cuzzocrea, A., Dayal, U. (eds.) DaWaK 2011. LNCS, vol. 6862, pp. 96–111. Springer, Heidelberg (2011)
Chapter Google Scholar
Chen, J., Le, J.: Programming model based on mapreduce for importing big table into hdfs. J. Comput. Appl. 33(9), 2486–2489, 2561 (2013)
Google Scholar
http://sqoop.apache.org/
Ting, K., Cecho, J.J.: Apache Sqoop Cookbook. O’Reilly Media, Inc., CA (2013)
Google Scholar

Download references

Acknowledgements

This work is supported by the Core Electronic Devices, High-end Generic Chips and Basic Software of National Science and Technology Major Projects of China,No.2013ZX01039002.

Author information

Authors and Affiliations

State Key Laboratory of High-end Server and Storage Technology, System Soft Department of Inspur, Jinan, 250101, China
Lianchao Cao, Zhanqiang Li, Kaiyuan Qi, Guomao Xin & Dong Zhang

Authors

Lianchao Cao
View author publications
You can also search for this author in PubMed Google Scholar
Zhanqiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Kaiyuan Qi
View author publications
You can also search for this author in PubMed Google Scholar
Guomao Xin
View author publications
You can also search for this author in PubMed Google Scholar
Dong Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lianchao Cao .

Editor information

Editors and Affiliations

Electrical and Computer Engineering, The University of British Columbia, Vancouver, British Columbia, Canada
Victor C.M. Leung
Cofederal Networks Inc., Renton, Washington, USA
Roy Xiaorong Lai
School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China
Min Chen
School of Mechanical and Automotive Engineering, South China University of Technology, Guangzhou, China
Jiafu Wan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cao, L., Li, Z., Qi, K., Xin, G., Zhang, D. (2015). An Efficient Data Extracting Method Based on Hadoop. In: Leung, V., Lai, R., Chen, M., Wan, J. (eds) Cloud Computing. CloudComp 2014. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 142. Springer, Cham. https://doi.org/10.1007/978-3-319-16050-4_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-16050-4_8
Published: 08 March 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-16049-8
Online ISBN: 978-3-319-16050-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics