Abstract
MapReduce has proven remarkably effective for a wide variety of data-intensive applications, but it was designed to run on large single-site homogeneous clusters. Researchers have begun to explore the extent to which the original MapReduce assumptions can be relaxed including skewed workloads, iterative applications, and heterogeneous computing environments. This chapter continues this exploration by applying MapReduce across widely distributed data over distributed computation resources. This problem arises when datasets are generated and stored at multiple sites as is common in many scientific domains and increasingly e-commerce applications. It also occurs when multi-site resources such as geographically separated data centers are applied to the same application. Using Hadoop, we show that the absence of network and node homogeneity and locality of data lead to poor performance. The problem is that interaction of MapReduce phases becomes pronounced in the presence of heterogeneous network behavior. In this paper, we propose new cross-phase optimization techniques that enable independent MapReduce phases to influence one another. We propose techniques that optimize the push and map phases to enable push-map overlap and to allow map behavior to feed back into push dynamics. Similarly, we propose techniques that optimize the map and reduce phases to enable shuffle cost to feed back and affect map scheduling decisions. We evaluate the benefits of our techniques in both Amazon EC2 and PlanetLab. The experimental results show the potential of these techniques as performance is improved from 7 to 18 % depending on the execution environment and application.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
When this is not possible, tasks must read their inputs remotely, and scarce network bandwidth becomes a limiting factor.
- 3.
This and the remaining figures and tables from this chapter are reproduced from [23].
- 4.
To improve fault tolerance, we have also added an option to cache and replicate inputs at the compute nodes. This reduces the need to re-fetch remote data after task failures or for speculative execution.
- 5.
Throughout this chapter, error bars indicate 95 % confidence intervals.
References
Cloudera impala. Http://www.cloudera.com/impala
Scalding. Http://github.com/twitter/scalding
A conversation with Jim Gray. Queue 1(4), 8–17 (2003). DOI 10.1145/864056.864078. URL http://doi.acm.org/10.1145/864056.864078
Ahmad, F., Chakradhar, S., Raghunathan, A., Vijaykumar, T.N.: Tarazu: Optimizing MapReduce on heterogeneous clusters. In: Proceedings of ASPLOS, pp. 61–74 (2012)
Ananthanarayanan, G., Kandula, S., Greenberg, A., Stoica, I., Lu, Y., Saha, B.: Reining in the outliers in map-reduce clusters using mantri. In: Proceedings of OSDI, pp. 265–278 (2010)
Arlitt, M., Jin, T.: Workload characterization of the 1998 World Cup Web Site. Tech. Rep. HPL-1999-35R1, HP Labs (1999)
Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R.H., Konwinsky, A., Lee, G., Patterson, D.A., Rabkin, A., Stoica, I., Zaharia, M.: Above the clouds: A Berkeley view of cloud computing. Tech. Rep. UCB/EECS-2009-28, Electrical Engineering and Computer Sciences, University of California at Berkeley (2009)
Babu, S.: Towards automatic optimization of MapReduce programs. In: Proceedings of ACM SoCC, pp. 137–142 (2010)
Baker, J., et al.: Megastore: Providing scalable, highly available storage for interactive services. In: Proceedings of CIDR, pp. 223–234 (2011)
BitTorrent. Http://www.bittorrent.com
Cardosa, M., Wang, C., Nangia, A., Chandra, A., Weissman, J.: Exploring MapReduce efficiency with highly-distributed data. In: Proceedings of MapReduce, pp. 27–33 (2011)
Chen, Q., Zhang, D., Guo, M., Deng, Q., Guo, S., Guo, S.: SAMR: A self-adaptive MapReduce scheduling algorithm in heterogeneous environment. In: Proceedings of IEEE CIT, pp. 2736–2743 (2010)
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce online. In: Proceedings of NSDI, pp. 313–327 (2010)
Corbett, J.C., et al.: Spanner: Google’s globally-distributed database. In: Proceedings of OSDI, pp. 251–264 (2012)
Costa, F., Silva, L., Dahlin, M.: Volunteer cloud computing: MapReduce over the internet. In: Proceedings of IEEE IPDPSW, pp. 1855–1862 (2011)
Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: Proceedings of OSDI, pp. 137–150 (2004)
Gadre, H., Rodero, I., Parashar, M.: Investigating MapReduce framework extensions for efficient processing of geographically scattered datasets. In: Proceedings of ACM SIGMETRICS, pp. 116–118 (2011)
Gates, A.F., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S.M., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a high-level dataflow system on top of map-reduce: the pig experience. Proceedings of the VLDB Endowment 2(2), 1414–1425 (2009). URL http://dl.acm.org/citation.cfm?id=1687553.1687568
Free eBooks by Project Gutenberg. http://www.gutenberg.org/
Hadoop. http://hadoop.apache.org
Heintz, B., Chandra, A., Sitaraman, R.K.: Optimizing MapReduce for highly distributed environments. Tech. Rep. TR 12-003, Department of Computer Science and Engineering, University of Minnesota (2012)
Heintz, B., Wang, C., Chandra, A., Weissman, J.: Cross-phase optimization in MapReduce. In: IEEE International Conference on Cloud Engineering (IC2E), pp. 338–347 (2013)
Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B., Babu, S.: Starfish: A self-tuning system for big data analytics. In: Proceedings of CIDR, pp. 261–272 (2011)
Hey, T., Tansley, S., Tolle, K. (eds.): The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, Redmond, Washington (2009). URL http://research.microsoft.com/en-us/collaboration/fourthparadigm/
Kim, S., Won, J., Han, H., Eom, H., Yeom, H.Y.: Improving Hadoop performance in intercloud environments. Proceedings of ACM SIGMETRICS 39(3), 107–109 (2011)
Lin, H., Ma, X., Archuleta, J., Feng, W.c., Gardner, M., Zhang, Z.: MOON: MapReduce on opportunistic environments. In: Proceedings of ACM HPDC, pp. 95–106 (2010)
Luo, Y., Guo, Z., Sun, Y., Plale, B., Qiu, J., Li, W.W.: A hierarchical framework for cross-domain MapReduce execution. In: Proceedings of ECMLS, pp. 15–22 (2011)
Hadoop. http://mahout.apache.org
Nygren, E., Sitaraman, R., Sun, J.: The Akamai network: A platform for high-performance internet applications. ACM SIGOPS Oper. Syst. Rev. 44(3), 2–19 (2010)
Palanisamy, B., Singh, A., Liu, L., Jain, B.: Purlieus: locality-aware resource allocation for MapReduce in a cloud. In: Proceedings of SC, pp. 58:1–58:11 (2011)
Rabkin, A., Arye, M., Sen, S., Pai, V., Freedman, M.J.: Making every bit count in wide-area analytics. In: Proceedings of the 14th Workshop on Hot Topics in Operating Systems (2013)
Sandholm, T., Lai, K.: MapReduce optimization using dynamic regulated prioritization. In: Proceedings of ACM SIGMETRICS, pp. 299–310 (2009)
Satyanarayanan, M., Bahl, P., Caceres, R., Davies, N.: The case for VM-based cloudlets in mobile computing. IEEE Pervasive Computing 8, 14–23 (2009)
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2, 1626–1629 (2009)
Verma, A., Zea, N., Cho, B., Gupta, I., Campbell, R.H.: Breaking the MapReduce stage barrier. In: Proceedings of IEEE Cluster, pp. 235–244 (2010)
Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R.H., Stoica, I.: Improving MapReduce performance in heterogeneous environments. In: Proceedings of OSDI, pp. 29–42 (2008)
Acknowledgements
The authors would like to acknowledge Professor Ramesh Sitaraman (ramesh@cs.umass.edu) for his contributions to our model-driven optimization approach (Sect. 2), as well as Chenyu Wang (chwang@cs.umn.edu) for his contributions to the cross-phase optimization techniques. They would further like to acknowledge NSF Grants IIS-0916425 and CNS-0643505, which supported this research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer Science+Business Media New York
About this chapter
Cite this chapter
Heintz, B., Chandra, A., Weissman, J. (2014). Cross-Phase Optimization in MapReduce. In: Li, X., Qiu, J. (eds) Cloud Computing for Data-Intensive Applications. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-1905-5_12
Download citation
DOI: https://doi.org/10.1007/978-1-4939-1905-5_12
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4939-1904-8
Online ISBN: 978-1-4939-1905-5
eBook Packages: Computer ScienceComputer Science (R0)