Cross-Phase Optimization in MapReduce

Heintz, Benjamin; Chandra, Abhishek; Weissman, Jon

doi:10.1007/978-1-4939-1905-5_12

Benjamin Heintz³,
Abhishek Chandra³ &
Jon Weissman³

1596 Accesses
4 Citations

Abstract

MapReduce has proven remarkably effective for a wide variety of data-intensive applications, but it was designed to run on large single-site homogeneous clusters. Researchers have begun to explore the extent to which the original MapReduce assumptions can be relaxed including skewed workloads, iterative applications, and heterogeneous computing environments. This chapter continues this exploration by applying MapReduce across widely distributed data over distributed computation resources. This problem arises when datasets are generated and stored at multiple sites as is common in many scientific domains and increasingly e-commerce applications. It also occurs when multi-site resources such as geographically separated data centers are applied to the same application. Using Hadoop, we show that the absence of network and node homogeneity and locality of data lead to poor performance. The problem is that interaction of MapReduce phases becomes pronounced in the presence of heterogeneous network behavior. In this paper, we propose new cross-phase optimization techniques that enable independent MapReduce phases to influence one another. We propose techniques that optimize the push and map phases to enable push-map overlap and to allow map behavior to feed back into push dynamics. Similarly, we propose techniques that optimize the map and reduce phases to enable shuffle cost to feed back and affect map scheduling decisions. We evaluate the benefits of our techniques in both Amazon EC2 and PlanetLab. The experimental results show the potential of these techniques as performance is improved from 7 to 18 % depending on the execution environment and application.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 179.00; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://newsroom.fb.com.
2.
When this is not possible, tasks must read their inputs remotely, and scarce network bandwidth becomes a limiting factor.
3.
This and the remaining figures and tables from this chapter are reproduced from [23].
4.
To improve fault tolerance, we have also added an option to cache and replicate inputs at the compute nodes. This reduces the need to re-fetch remote data after task failures or for speculative execution.
5.
Throughout this chapter, error bars indicate 95 % confidence intervals.

References

Cloudera impala. Http://www.cloudera.com/impala
Scalding. Http://github.com/twitter/scalding
A conversation with Jim Gray. Queue 1(4), 8–17 (2003). DOI 10.1145/864056.864078. URL http://doi.acm.org/10.1145/864056.864078
Ahmad, F., Chakradhar, S., Raghunathan, A., Vijaykumar, T.N.: Tarazu: Optimizing MapReduce on heterogeneous clusters. In: Proceedings of ASPLOS, pp. 61–74 (2012)
Google Scholar
Ananthanarayanan, G., Kandula, S., Greenberg, A., Stoica, I., Lu, Y., Saha, B.: Reining in the outliers in map-reduce clusters using mantri. In: Proceedings of OSDI, pp. 265–278 (2010)
Google Scholar
Arlitt, M., Jin, T.: Workload characterization of the 1998 World Cup Web Site. Tech. Rep. HPL-1999-35R1, HP Labs (1999)
Google Scholar
Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R.H., Konwinsky, A., Lee, G., Patterson, D.A., Rabkin, A., Stoica, I., Zaharia, M.: Above the clouds: A Berkeley view of cloud computing. Tech. Rep. UCB/EECS-2009-28, Electrical Engineering and Computer Sciences, University of California at Berkeley (2009)
Google Scholar
Babu, S.: Towards automatic optimization of MapReduce programs. In: Proceedings of ACM SoCC, pp. 137–142 (2010)
Google Scholar
Baker, J., et al.: Megastore: Providing scalable, highly available storage for interactive services. In: Proceedings of CIDR, pp. 223–234 (2011)
Google Scholar
BitTorrent. Http://www.bittorrent.com
Cardosa, M., Wang, C., Nangia, A., Chandra, A., Weissman, J.: Exploring MapReduce efficiency with highly-distributed data. In: Proceedings of MapReduce, pp. 27–33 (2011)
Google Scholar
Chen, Q., Zhang, D., Guo, M., Deng, Q., Guo, S., Guo, S.: SAMR: A self-adaptive MapReduce scheduling algorithm in heterogeneous environment. In: Proceedings of IEEE CIT, pp. 2736–2743 (2010)
Google Scholar
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce online. In: Proceedings of NSDI, pp. 313–327 (2010)
Google Scholar
Corbett, J.C., et al.: Spanner: Google’s globally-distributed database. In: Proceedings of OSDI, pp. 251–264 (2012)
Google Scholar
Costa, F., Silva, L., Dahlin, M.: Volunteer cloud computing: MapReduce over the internet. In: Proceedings of IEEE IPDPSW, pp. 1855–1862 (2011)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: Proceedings of OSDI, pp. 137–150 (2004)
Google Scholar
Gadre, H., Rodero, I., Parashar, M.: Investigating MapReduce framework extensions for efficient processing of geographically scattered datasets. In: Proceedings of ACM SIGMETRICS, pp. 116–118 (2011)
Google Scholar
Gates, A.F., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S.M., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a high-level dataflow system on top of map-reduce: the pig experience. Proceedings of the VLDB Endowment 2(2), 1414–1425 (2009). URL http://dl.acm.org/citation.cfm?id=1687553.1687568
GridFTP. Http://globus.org/toolkit/docs/3.2/gridftp/
Free eBooks by Project Gutenberg. http://www.gutenberg.org/
Hadoop. http://hadoop.apache.org
Heintz, B., Chandra, A., Sitaraman, R.K.: Optimizing MapReduce for highly distributed environments. Tech. Rep. TR 12-003, Department of Computer Science and Engineering, University of Minnesota (2012)
Google Scholar
Heintz, B., Wang, C., Chandra, A., Weissman, J.: Cross-phase optimization in MapReduce. In: IEEE International Conference on Cloud Engineering (IC2E), pp. 338–347 (2013)
Google Scholar
Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B., Babu, S.: Starfish: A self-tuning system for big data analytics. In: Proceedings of CIDR, pp. 261–272 (2011)
Google Scholar
Hey, T., Tansley, S., Tolle, K. (eds.): The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, Redmond, Washington (2009). URL http://research.microsoft.com/en-us/collaboration/fourthparadigm/
Kim, S., Won, J., Han, H., Eom, H., Yeom, H.Y.: Improving Hadoop performance in intercloud environments. Proceedings of ACM SIGMETRICS 39(3), 107–109 (2011)
Article Google Scholar
Lin, H., Ma, X., Archuleta, J., Feng, W.c., Gardner, M., Zhang, Z.: MOON: MapReduce on opportunistic environments. In: Proceedings of ACM HPDC, pp. 95–106 (2010)
Google Scholar
Luo, Y., Guo, Z., Sun, Y., Plale, B., Qiu, J., Li, W.W.: A hierarchical framework for cross-domain MapReduce execution. In: Proceedings of ECMLS, pp. 15–22 (2011)
Google Scholar
Hadoop. http://mahout.apache.org
Nygren, E., Sitaraman, R., Sun, J.: The Akamai network: A platform for high-performance internet applications. ACM SIGOPS Oper. Syst. Rev. 44(3), 2–19 (2010)
Article Google Scholar
Palanisamy, B., Singh, A., Liu, L., Jain, B.: Purlieus: locality-aware resource allocation for MapReduce in a cloud. In: Proceedings of SC, pp. 58:1–58:11 (2011)
Google Scholar
Rabkin, A., Arye, M., Sen, S., Pai, V., Freedman, M.J.: Making every bit count in wide-area analytics. In: Proceedings of the 14th Workshop on Hot Topics in Operating Systems (2013)
Google Scholar
Sandholm, T., Lai, K.: MapReduce optimization using dynamic regulated prioritization. In: Proceedings of ACM SIGMETRICS, pp. 299–310 (2009)
Google Scholar
Satyanarayanan, M., Bahl, P., Caceres, R., Davies, N.: The case for VM-based cloudlets in mobile computing. IEEE Pervasive Computing 8, 14–23 (2009)
Article Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2, 1626–1629 (2009)
Article Google Scholar
Verma, A., Zea, N., Cho, B., Gupta, I., Campbell, R.H.: Breaking the MapReduce stage barrier. In: Proceedings of IEEE Cluster, pp. 235–244 (2010)
Google Scholar
Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R.H., Stoica, I.: Improving MapReduce performance in heterogeneous environments. In: Proceedings of OSDI, pp. 29–42 (2008)
Google Scholar

Download references

Acknowledgements

The authors would like to acknowledge Professor Ramesh Sitaraman (ramesh@cs.umass.edu) for his contributions to our model-driven optimization approach (Sect. 2), as well as Chenyu Wang (chwang@cs.umn.edu) for his contributions to the cross-phase optimization techniques. They would further like to acknowledge NSF Grants IIS-0916425 and CNS-0643505, which supported this research.

Author information

Authors and Affiliations

University of Minnesota, Keller Hall 4-192, 200 Union Street SE, Minneapolis, MN, 55455, USA
Benjamin Heintz, Abhishek Chandra & Jon Weissman

Authors

Benjamin Heintz
View author publications
You can also search for this author in PubMed Google Scholar
Abhishek Chandra
View author publications
You can also search for this author in PubMed Google Scholar
Jon Weissman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Benjamin Heintz .

Editor information

Editors and Affiliations

University of Florida, Gainesville, Florida, USA
Xiaolin Li
Indiana University, Bloomington, Indiana, USA
Judy Qiu

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Heintz, B., Chandra, A., Weissman, J. (2014). Cross-Phase Optimization in MapReduce. In: Li, X., Qiu, J. (eds) Cloud Computing for Data-Intensive Applications. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-1905-5_12

Download citation

DOI: https://doi.org/10.1007/978-1-4939-1905-5_12
Published: 15 November 2014
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4939-1904-8
Online ISBN: 978-1-4939-1905-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics