mpCache: Accelerating MapReduce with Hybrid Storage System on Many-Core Clusters

Wang, Bo; Jiang, Jinlei; Yang, Guangwen

doi:10.1007/978-3-662-44917-2_19

Bo Wang¹⁸,
Jinlei Jiang^18,19 &
Guangwen Yang¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8707))

Included in the following conference series:

IFIP International Conference on Network and Parallel Computing

2054 Accesses
1 Citations

Abstract

As a widely used programming model and implementation for processing large data sets, MapReduce does not scale well on many-core clusters, which, unfortunately, are common in current data centers. To deal with the problem, this paper: 1) analyzes the causes of poor scalability of MapReduce on many-core clusters and identifies the key one as the underlying low-speed storage (hard disk) can not meet the requirements of frequent IO operations, and 2) proposes mpCache, a SSD based hybrid storage system that caches both Input Data and Localized Data, and dynamically tunes the cache space allocation between them to make full use of the space. mpCache has been incorporated into Hadoop and evaluated on a 7-node cluster by 13 benchmarks. The experimental results show that mpCache gains an average speedup of 2.09 when compared with the original Hadoop, and achieves an average speedup of 1.79 when compared with PACMan, the latest in-memory optimization of MapReduce.

Download to read the full chapter text

Chapter PDF

Advances in MapReduce Big Data Processing: Platform, Tools, and Algorithms

Experiences of Converging Big Data Analytics Frameworks with High Performance Computing Systems

Scaling up MapReduce-based Big Data Processing on Multi-GPU systems

Article 22 August 2014

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Ahmad, F., Lee, S., Thottethodi, M., Vijaykumar, T.: Puma: Purdue mapreduce benchmarks suite (2012), http://web.ics.purdue.edu/~fahmad/benchmarks.htm
Ananthanarayanan, G., Ghodsi, A., Wang, A., Borthakur, D., Kandula, S., Shenker, S., Stoica, I.: Pacman: Coordinated memory caching for parallel jobs. In: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, NSDI 2012, p. 20. USENIX (2012)
Google Scholar
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: Efficient iterative data processing on large clusters. Proceedings of the VLDB Endowment 3(1-2), 285–296 (2010)
Article Google Scholar
Chen, F., Koufaty, D.A., Zhang, X.: Hystor: making the best use of solid state drives in high performance storage systems. In: Proceedings of the International Conference on Supercomputing, ICS 2011, pp. 22–32. ACM (2011)
Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
Article Google Scholar
Feeley, M.J., Morgan, W.E., Pighin, E., Karlin, A.R., Levy, H.M., Thekkath, C.A.: Implementing global memory management in a workstation cluster. ACM (1995)
Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.-T.: The google file system. ACM SIGOPS Operating Systems Review 37, 29–43 (2003)
Article Google Scholar
Handy, J.: Flash memory vs. hard disk drives - which will win?, http://www.storagesearch.com/semico-art1.html
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. ACM SIGOPS Operating Systems Review 41(3), 59–72 (2007)
Article Google Scholar
Kim, Y., Gupta, A., Urgaonkar, B., Berman, P., Sivasubramaniam, A.: Hybridstore: A cost-efficient, high-performance storage system combining ssds and hdds. In: 2011 IEEE 19th International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, MASCOTS 2011, pp. 227–236. IEEE (2011)
Google Scholar
Knuth, D.E.: The art of computer programming, vol. 3. Addison-Wesley, Reading Mass. Pearson Education (2005)
Google Scholar
Oh, Y., Choi, J., Lee, D., Noh, S.H.: Caching less for better performance: Balancing cache size and update cost of flash memory cache in hybrid storage systems. In: Proceedings of the 10th USENIX Conference on File and Storage Technologies, FAST 2012, p. 25. USENIX (2012)
Google Scholar
Ousterhout, J., Agrawal, P., Erickson, D., Kozyrakis, C., Leverich, J., Mazières, D., Mitra, S., Narayanan, A., Parulkar, G., Rosenblum, M., et al.: The case for ramclouds: scalable high-performance storage entirely in dram. ACM SIGOPS Operating Systems Review 43(4), 92–105 (2010)
Article Google Scholar
Pritchett, T., Thottethodi, M.: Sievestore: a highly-selective, ensemble-level disk cache for cost-performance. In: Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA 2010, pp. 163–174. ACM (2010)
Google Scholar
Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating mapreduce for multi-core and multiprocessor systems. In: IEEE 13th International Symposium on High Performance Computer Architecture, HPCA 2007, pp. 13–24. IEEE (2007)
Google Scholar
Schindler, J., Shete, S., Smith, K.A.: Improving throughput for small disk requests with proximal i/o. In: Proceedings of the 9th USENIX Conference on File and Storage Technologies, FAST 2011, pp. 133–147. USENIX (2011)
Google Scholar
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST 2010, pp. 1–10. IEEE (2010)
Google Scholar
Stuart, J.A., Owens, J.D.: Multi-gpu mapreduce on gpu clusters. In: 2011 IEEE International Parallel & Distributed Processing Symposium, IPDPS 2011, pp. 1068–1079. IEEE (2011)
Google Scholar
Talbot, J., Yoo, R.M., Kozyrakis, C.: Phoenix++: modular mapreduce for shared-memory systems. In: Proceedings of the Second International Workshop on MapReduce and Its Applications, pp. 9–16. ACM (2011)
Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2(2), 1626–1629 (2009)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tsinghua National Laboratory for Information Science and Technology (TNLIST), Tsinghua University, Beijing, 100084, China
Bo Wang, Jinlei Jiang & Guangwen Yang
Technology Innovation Center at Yinzhou, Yangtze Delta Region Institute of Tsinghua University, Zhejiang, 314006, China
Jinlei Jiang

Authors

Bo Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jinlei Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Guangwen Yang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Chung Hua University, 707, Sec. 2, WuFu Rd., 30012, Hsinchu, Taiwan
Ching-Hsien Hsu
Huazhong University of Science and Technology, 1037#, Luoyu Road, 430074, Wuhan, China
Xuanhua Shi
IBM Thomas J. Watson Research Center, 1101 Kitchawan Rd., 10598, Yorktown Heights, NY, USA
Valentina Salapura

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, B., Jiang, J., Yang, G. (2014). mpCache: Accelerating MapReduce with Hybrid Storage System on Many-Core Clusters. In: Hsu, CH., Shi, X., Salapura, V. (eds) Network and Parallel Computing. NPC 2014. Lecture Notes in Computer Science, vol 8707. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44917-2_19

Download citation

DOI: https://doi.org/10.1007/978-3-662-44917-2_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44916-5
Online ISBN: 978-3-662-44917-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

mpCache: Accelerating MapReduce with Hybrid Storage System on Many-Core Clusters

Abstract

Chapter PDF

Similar content being viewed by others

Advances in MapReduce Big Data Processing: Platform, Tools, and Algorithms

Experiences of Converging Big Data Analytics Frameworks with High Performance Computing Systems

Scaling up MapReduce-based Big Data Processing on Multi-GPU systems

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

mpCache: Accelerating MapReduce with Hybrid Storage System on Many-Core Clusters

Abstract

Chapter PDF

Similar content being viewed by others

Advances in MapReduce Big Data Processing: Platform, Tools, and Algorithms

Experiences of Converging Big Data Analytics Frameworks with High Performance Computing Systems

Scaling up MapReduce-based Big Data Processing on Multi-GPU systems

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation