Abstract
The storage and access of massive small files are one of the challenges in the design of distributed file system. Hadoop distributed file system (HDFS) is primarily designed for reliable storage and fast access of very big files while it suffers a performance penalty with increasing number of small files. A middleware called Hmfs is proposed in this paper to improve the efficiency of storing and accessing small files on HDFS. It is made up of three layers, file operation interfaces to make it easier for software developers to submit different file requests, file management tasks to merge small files into big ones or extract small files from big ones in the background, and file buffers to improve the I/O performance. Hmfs boosts the file upload speed by using asynchronous write mechanism and the file download speed by adopting prefetching and caching strategy. The experimental results show that Hmfs can help to obtain high speed of storage and access for massive small files on HDFS.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Hadoop, http://hadoop.apache.org/
Shvachko, K., Kuang, H.: Radia. S.: The hadoop distributed file system. In: IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST 2010). Incline Village, Nevada (2010)
Dong, B., Zheng, Q., Tian, F., et al.: An optimized approach for storing and accessing small files on cloud storage. Journal of Network and Computer Applications 35(6), 1847–1862 (2012)
Dong, B., Qiu, J., Zheng, Q., et al.: A novel approach to improving the efficiency of storing and accessing small files on hadoop: a case study by powerpoint files. In: IEEE International Conference on Services Computing (SCC 2010), Miami, Florida, USA (2010)
Liu, X., Han, J., Zhong, Y., et al.: Implementing WebGIS on hadoop: a case study of improving small file I/O performance on HDFS. In: IEEE International Conference on Cluster Computing and Workshops (CLUSTER 2009), New Orleans, LA, USA (2009)
Cui, J., Zhang, Y., Li, C., Xing, C.: A packaging approach for massive amounts of small geospatial files with HDFS. In: Gao, H., Lim, L., Wang, W., Li, C., Chen, L. (eds.) WAIM 2012. LNCS, vol. 7418, pp. 210–215. Springer, Heidelberg (2012)
Hadoop Archives, http://hadoop.apache.org/common/docs/r0.20.2/hadoop_archive
Sequence File, http://wiki.apache.org/hadoop/SequenceFile
Hbase, http://hbase.apache.org/
Gohil, P., Panchal, B.: Efficient ways to improve the performance of HDFS for small files. Computer Engineering and Intelligent Systems 5(1), 45–49 (2014)
Wang, Y., Zhang, S., Liu, H.: The design of distributed file system based on HDFS. Applied Mechanics and Materials 423, 2733–2736 (2013)
Mao, Y., Min, W.: Storage and accessing small files based on HDFS. In: Patnaik, S., Li, X. (eds.) 4th International Conference on Computer Science and Information Technology (CCSIT 2014). AISC, vol. 255, pp. 565–573. Springer, Heidelberg (2014)
Chandrasekar, S., Dakshinamurthy, R., Seshakumar, P., et al.: A novel indexing scheme for efficient handling of small files in hadoop distributed file system. In: 2013 International Conference on Computer Communication and Informatics, ICCCI 2013 (2013)
Mackey, G., Sehrish, S., Wang, J.: Improving metadata management for small files in HDFS. In: IEEE International Conference on Cluster Computing and Workshops (CLUSTER 2009), New Orleans, Louisiana, USA (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Yan, C., Li, T., Huang, Y., Gan, Y. (2014). Hmfs: Efficient Support of Small Files Processing over HDFS. In: Sun, Xh., et al. Algorithms and Architectures for Parallel Processing. ICA3PP 2014. Lecture Notes in Computer Science, vol 8631. Springer, Cham. https://doi.org/10.1007/978-3-319-11194-0_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-11194-0_5
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11193-3
Online ISBN: 978-3-319-11194-0
eBook Packages: Computer ScienceComputer Science (R0)