A Strategy for Small Files Processing in HDFS

Bao, Zhenshan; Xu, Shikun; Zhang, Wenbo; Chen, Juncheng; Liu, Jianli

doi:10.1007/978-981-10-2053-7_11

A Strategy for Small Files Processing in HDFS

Zhenshan Bao²⁰,
Shikun Xu²⁰,
Wenbo Zhang²⁰,
Juncheng Chen²⁰ &
…
Jianli Liu²⁰

Conference paper
First Online: 31 July 2016

1398 Accesses
1 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 623))

Abstract

Hadoop distributed file system (HDFS) as a popular cloud storage platform, benefiting from its scalable, reliable and low-cost storage capability. However it is mainly designed for batch processing of large files, it’s mean that small files cannot be efficiently handled by HDFS. In this paper, we propose a mechanism to store small files in HDFS. In our approach, file size need to be judged before uploading to HDFS. If the file size is less than the size of the block, all correlated small files will be merged into one single file and we will build index for each small file. Furthermore, prefetching and caching mechanism are used to improve the reading efficiency of small files. Meanwhile, for the new small files, we can execute appending operation on the basis of merged file. Contrasting to original HDFS, experimental results show that the storage efficiency of small files is improved.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

http://www.emc.com/leadership/digital-universe/2014iview/index.html
Apache Hadoop. http://hadoop.apache.org/
White, T.: Hadoop: The Definitive Guide, 4E. O’Reilly Media (2015)
Google Scholar
Liu, X., Peng, C., Yu, Z.: Research on the small files problem of Hadoop. In: International Conference on Education, Management, Commerce and Society (EMCS 2015). Atlantis Press (2015)
Google Scholar
HadoopArchivesGuide. http://hadoop.apache.org/docs/stable/hadoop-archives/HadoopArchives.html
SequenceFile. http://wiki.apache.org/hadoop/SequenceFile
CombineFileInputFormat. http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html
Liu, X., Han, J., Zhong, Y., Han, C., He, X.: Implementing WebGIS on Hadoop: a case study of improving small file I/O performance on HDFS. In: IEEE International Conference on Cluster Computing and Workshops, pp. 1–8 (2009)
Google Scholar
Dong, B., Qiu, J., Zheng, Q., Zhong, X., Li, J., Li, Y.: A novel approach to improving the efficiency of storing and accessing small files on Hadoop: a case study by PowerPoint files. In: IEEE International Conference on Services Computing (SCC), pp. 65–72 (2010)
Google Scholar
Zhao, X., Yang, Y., Sun, L.-L., et al.: Based on the Hadoop mass MP3 file storage structure. J. Comput. Appl. 32(6), 1724–1726 (2012)
Google Scholar
Vorapongkitipun, C., Nupairoj, N.: Improving performance of small-file accessing in Hadoop. In: 11th International Joint Conference on Computer Science and Software Engineering (JCSSE), pp. 200–205 (2014)
Google Scholar
Patel, A., Mehta, M.A.: A novel approach for efficient handling of small files in HDFS. In: 2015 IEEE International Advance Computing Conference (IACC), pp. 1258–1262 (2015)
Google Scholar
Changtong, L.: An improved HDFS for small file. In: 2016 18th International Conference on Advanced Communication Technology (ICACT) (2016). doi:10.1109/ICACT.2016.7423438
Peng, X., Feng, D., Jiang, H., Wang, F.: FARMER: a novel approach to file access correlation mining and evaluation reference model for optimizing peta-scale filesystem performance. In: Proceedings of the 17th International Symposium on High Performance Distributed Computing, pp. 185–196 (2008)
Google Scholar
Dong, B., Zhong, X., Zheng, Q., Jian, L., Liu, J., Qiu, J., Li, Y.: Correlation based file prefetching approach for Hadoop. In: IEEE Second International Conference on Cloud Computing Technology and Science (CloudCom), pp. 41–48 (2010). [14]
Google Scholar

Download references

Acknowledgement

This research supported by Beijing Key Laboratory on Integration and Analysis of Large Scale Stream Data (ID: PXM2015_014204_500221) and the significant special project for Core electronic devices, high-end general chips and basic software products. (2012ZX01039-004).

Author information

Authors and Affiliations

College of Computer Science, Beijing University of Technology, Beijing, 100124, China
Zhenshan Bao, Shikun Xu, Wenbo Zhang, Juncheng Chen & Jianli Liu

Authors

Zhenshan Bao
View author publications
You can also search for this author in PubMed Google Scholar
Shikun Xu
View author publications
You can also search for this author in PubMed Google Scholar
Wenbo Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Juncheng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jianli Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhenshan Bao .

Editor information

Editors and Affiliations

Harbin Institute of Technology , Harbin, China
Wanxiang Che
Harbin Engineering University , Harbin, China
Qilong Han
Harbin Institute of Technology , Harbin, China
Hongzhi Wang
Northeast Forestry University , Harbin, China
Weipeng Jing
National University of Defense Technology , Changsha, China
Shaoliang Peng
Harbin Engineering University , Harbin, China
Junyu Lin
Harbin Univ. of Science and Technology , Harbin, China
Guanglu Sun
Harbin Univ. of Science and Technology , Harbin, China
Xianhua Song
Harbin Engineering University , Harbin, China
Hongtao Song
Harbin Sea of Clouds & Computer Tech. , Harbin, China
Zeguang Lu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bao, Z., Xu, S., Zhang, W., Chen, J., Liu, J. (2016). A Strategy for Small Files Processing in HDFS. In: Che, W., et al. Social Computing. ICYCSEE 2016. Communications in Computer and Information Science, vol 623. Springer, Singapore. https://doi.org/10.1007/978-981-10-2053-7_11

Download citation

DOI: https://doi.org/10.1007/978-981-10-2053-7_11
Published: 31 July 2016
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-2052-0
Online ISBN: 978-981-10-2053-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics