Abstract
As a widely used programming model and implementation for processing large data sets, MapReduce does not scale well on many-core clusters, which, unfortunately, are common in current data centers. To deal with the problem, this paper: 1) analyzes the causes of poor scalability of MapReduce on many-core clusters and identifies the key one as the underlying low-speed storage (hard disk) can not meet the requirements of frequent IO operations, and 2) proposes mpCache, a SSD based hybrid storage system that caches both Input Data and Localized Data, and dynamically tunes the cache space allocation between them to make full use of the space. mpCache has been incorporated into Hadoop and evaluated on a 7-node cluster by 13 benchmarks. The experimental results show that mpCache gains an average speedup of 2.09 when compared with the original Hadoop, and achieves an average speedup of 1.79 when compared with PACMan, the latest in-memory optimization of MapReduce.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Ahmad, F., Lee, S., Thottethodi, M., Vijaykumar, T.: Puma: Purdue mapreduce benchmarks suite (2012), http://web.ics.purdue.edu/~fahmad/benchmarks.htm
Ananthanarayanan, G., Ghodsi, A., Wang, A., Borthakur, D., Kandula, S., Shenker, S., Stoica, I.: Pacman: Coordinated memory caching for parallel jobs. In: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, NSDI 2012, p. 20. USENIX (2012)
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: Efficient iterative data processing on large clusters. Proceedings of the VLDB Endowment 3(1-2), 285–296 (2010)
Chen, F., Koufaty, D.A., Zhang, X.: Hystor: making the best use of solid state drives in high performance storage systems. In: Proceedings of the International Conference on Supercomputing, ICS 2011, pp. 22–32. ACM (2011)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
Feeley, M.J., Morgan, W.E., Pighin, E., Karlin, A.R., Levy, H.M., Thekkath, C.A.: Implementing global memory management in a workstation cluster. ACM (1995)
Ghemawat, S., Gobioff, H., Leung, S.-T.: The google file system. ACM SIGOPS Operating Systems Review 37, 29–43 (2003)
Handy, J.: Flash memory vs. hard disk drives - which will win?, http://www.storagesearch.com/semico-art1.html
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. ACM SIGOPS Operating Systems Review 41(3), 59–72 (2007)
Kim, Y., Gupta, A., Urgaonkar, B., Berman, P., Sivasubramaniam, A.: Hybridstore: A cost-efficient, high-performance storage system combining ssds and hdds. In: 2011 IEEE 19th International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, MASCOTS 2011, pp. 227–236. IEEE (2011)
Knuth, D.E.: The art of computer programming, vol. 3. Addison-Wesley, Reading Mass. Pearson Education (2005)
Oh, Y., Choi, J., Lee, D., Noh, S.H.: Caching less for better performance: Balancing cache size and update cost of flash memory cache in hybrid storage systems. In: Proceedings of the 10th USENIX Conference on File and Storage Technologies, FAST 2012, p. 25. USENIX (2012)
Ousterhout, J., Agrawal, P., Erickson, D., Kozyrakis, C., Leverich, J., Mazières, D., Mitra, S., Narayanan, A., Parulkar, G., Rosenblum, M., et al.: The case for ramclouds: scalable high-performance storage entirely in dram. ACM SIGOPS Operating Systems Review 43(4), 92–105 (2010)
Pritchett, T., Thottethodi, M.: Sievestore: a highly-selective, ensemble-level disk cache for cost-performance. In: Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA 2010, pp. 163–174. ACM (2010)
Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating mapreduce for multi-core and multiprocessor systems. In: IEEE 13th International Symposium on High Performance Computer Architecture, HPCA 2007, pp. 13–24. IEEE (2007)
Schindler, J., Shete, S., Smith, K.A.: Improving throughput for small disk requests with proximal i/o. In: Proceedings of the 9th USENIX Conference on File and Storage Technologies, FAST 2011, pp. 133–147. USENIX (2011)
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST 2010, pp. 1–10. IEEE (2010)
Stuart, J.A., Owens, J.D.: Multi-gpu mapreduce on gpu clusters. In: 2011 IEEE International Parallel & Distributed Processing Symposium, IPDPS 2011, pp. 1068–1079. IEEE (2011)
Talbot, J., Yoo, R.M., Kozyrakis, C.: Phoenix++: modular mapreduce for shared-memory systems. In: Proceedings of the Second International Workshop on MapReduce and Its Applications, pp. 9–16. ACM (2011)
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2(2), 1626–1629 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 IFIP International Federation for Information Processing
About this paper
Cite this paper
Wang, B., Jiang, J., Yang, G. (2014). mpCache: Accelerating MapReduce with Hybrid Storage System on Many-Core Clusters. In: Hsu, CH., Shi, X., Salapura, V. (eds) Network and Parallel Computing. NPC 2014. Lecture Notes in Computer Science, vol 8707. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44917-2_19
Download citation
DOI: https://doi.org/10.1007/978-3-662-44917-2_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44916-5
Online ISBN: 978-3-662-44917-2
eBook Packages: Computer ScienceComputer Science (R0)