Abstract
Hadoop distributed file system (HDFS) as a popular cloud storage platform, benefiting from its scalable, reliable and low-cost storage capability. However it is mainly designed for batch processing of large files, it’s mean that small files cannot be efficiently handled by HDFS. In this paper, we propose a mechanism to store small files in HDFS. In our approach, file size need to be judged before uploading to HDFS. If the file size is less than the size of the block, all correlated small files will be merged into one single file and we will build index for each small file. Furthermore, prefetching and caching mechanism are used to improve the reading efficiency of small files. Meanwhile, for the new small files, we can execute appending operation on the basis of merged file. Contrasting to original HDFS, experimental results show that the storage efficiency of small files is improved.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
http://www.emc.com/leadership/digital-universe/2014iview/index.html
Apache Hadoop. http://hadoop.apache.org/
White, T.: Hadoop: The Definitive Guide, 4E. O’Reilly Media (2015)
Liu, X., Peng, C., Yu, Z.: Research on the small files problem of Hadoop. In: International Conference on Education, Management, Commerce and Society (EMCS 2015). Atlantis Press (2015)
HadoopArchivesGuide. http://hadoop.apache.org/docs/stable/hadoop-archives/HadoopArchives.html
SequenceFile. http://wiki.apache.org/hadoop/SequenceFile
CombineFileInputFormat. http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html
Liu, X., Han, J., Zhong, Y., Han, C., He, X.: Implementing WebGIS on Hadoop: a case study of improving small file I/O performance on HDFS. In: IEEE International Conference on Cluster Computing and Workshops, pp. 1–8 (2009)
Dong, B., Qiu, J., Zheng, Q., Zhong, X., Li, J., Li, Y.: A novel approach to improving the efficiency of storing and accessing small files on Hadoop: a case study by PowerPoint files. In: IEEE International Conference on Services Computing (SCC), pp. 65–72 (2010)
Zhao, X., Yang, Y., Sun, L.-L., et al.: Based on the Hadoop mass MP3 file storage structure. J. Comput. Appl. 32(6), 1724–1726 (2012)
Vorapongkitipun, C., Nupairoj, N.: Improving performance of small-file accessing in Hadoop. In: 11th International Joint Conference on Computer Science and Software Engineering (JCSSE), pp. 200–205 (2014)
Patel, A., Mehta, M.A.: A novel approach for efficient handling of small files in HDFS. In: 2015 IEEE International Advance Computing Conference (IACC), pp. 1258–1262 (2015)
Changtong, L.: An improved HDFS for small file. In: 2016 18th International Conference on Advanced Communication Technology (ICACT) (2016). doi:10.1109/ICACT.2016.7423438
Peng, X., Feng, D., Jiang, H., Wang, F.: FARMER: a novel approach to file access correlation mining and evaluation reference model for optimizing peta-scale filesystem performance. In: Proceedings of the 17th International Symposium on High Performance Distributed Computing, pp. 185–196 (2008)
Dong, B., Zhong, X., Zheng, Q., Jian, L., Liu, J., Qiu, J., Li, Y.: Correlation based file prefetching approach for Hadoop. In: IEEE Second International Conference on Cloud Computing Technology and Science (CloudCom), pp. 41–48 (2010). [14]
Acknowledgement
This research supported by Beijing Key Laboratory on Integration and Analysis of Large Scale Stream Data (ID: PXM2015_014204_500221) and the significant special project for Core electronic devices, high-end general chips and basic software products. (2012ZX01039-004).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Science+Business Media Singapore
About this paper
Cite this paper
Bao, Z., Xu, S., Zhang, W., Chen, J., Liu, J. (2016). A Strategy for Small Files Processing in HDFS. In: Che, W., et al. Social Computing. ICYCSEE 2016. Communications in Computer and Information Science, vol 623. Springer, Singapore. https://doi.org/10.1007/978-981-10-2053-7_11
Download citation
DOI: https://doi.org/10.1007/978-981-10-2053-7_11
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-2052-0
Online ISBN: 978-981-10-2053-7
eBook Packages: Computer ScienceComputer Science (R0)