An Eye on the Elephant in the Wild: A Performance Evaluation of Hadoop’s Schedulers Under Failures

Ibrahim, Shadi; Phuong, Tran Anh; Antoniu, Gabriel

doi:10.1007/978-3-319-28448-4_11

Shadi Ibrahim¹⁵,
Tran Anh Phuong¹⁵ &
Gabriel Antoniu¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9438))

Included in the following conference series:

International Workshop on Adaptive Resource Management and Scheduling for Cloud Computing

482 Accesses
1 Citations

Abstract

Large-scale data analysis has increasingly come to rely on MapReduce and its open-source implementation Hadoop. Recently, Hadoop has not only been used for running single batch jobs but it has also been optimized to simultaneously support the execution of multiple jobs belonging to multiple concurrent users. Several schedulers (i.e., Fifo, Fair, and Capacity schedulers) have been proposed to optimize locality executions of tasks but do not consider failures, although, evidence in the literature shows that faults do occur and can probably result in performance problems.

In this paper, we have designed a set of experiments to evaluate the performance of Hadoop under failure when applying several schedulers (i.e., explore the conflict between job scheduling, exposing locality executions, and failures). Our results reveal several drawbacks of current Hadoop’s mechanism in prioritizing failed tasks. By trying to launch failed tasks as soon as possible regardless of locality, it significantly increases the execution time of jobs with failed tasks, due to two reasons: (1) available resources might not be freed up as quickly as expected and (2) failed tasks might be re-executed on machines with no data on it, introducing extra cost for data transferring through network, which is normally the most scarce resource in today’s data-centers. Our preliminary study with Hadoop not only helps us to understand the interplay between fault-tolerance and job scheduling, but also offers useful insights into optimizing the current schedulers to be more efficient in case of failures.

This work was done while Tran Anh Phuong was an intern at Inria Rennes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 34.99; Price excludes VAT (USA)

Softcover Book: USD 44.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://issues.apache.org/jira/browse/HADOOP-3412.

References

Amazon Elastic MapReduce. http://aws.amazon.com/elasticmapreduce/. Accessed May 2015
Apache Hadoop Welcome page. http://hadoop.apache.org. Accessed May 2015
Grid’5000 Home page. https://www.grid5000.fr/. Accessed May 2015
Size matters: yahoo claims 2-petabyte database is world’s biggest, busiest. http://www.computerworld.com/s/article/9087918/. Accessed May 2015
Ahmad, F., Lee, S., Thottethodi, M., Vijaykumar, T.N.: Puma: purdue mapreduce benchmarks suite. ECE Technical reports. Paper 437 (2012)
Google Scholar
Bicer, T., Jiang, W., Agrawal, G.: Supporting fault tolerance in a data-intensive computing middleware. In: 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS 2010), pp. 1–12. IEEE (2010)
Google Scholar
Borthakur, D.: Facebook has the world’s largest Hadoop cluster! http://hadoopblog.blogspot.fr/2010/05/facebook-has-worlds-largest-hadoop.html. Accessed May 2015
Dean, J.: Large-scale distributed systems at google: current systems and future directions. In: Keynote speech at the 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS) (2009)
Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th USENIX Conference on Symposium on Opearting Systems Design & Implementation (OSDI 2004), San Francisco, CA, USA, pp. 137–150 (2004)
Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Dinu, F., Eugene Ng, T.S.: Understanding the effects and implications of compute node related failures in hadoop. In: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2012, pp. 187–198. ACM, New York (2012)
Google Scholar
Fox, A., Griffith, R., Joseph, A., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Stoica, I.: Above the clouds: a berkeley view of cloud computing. Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Rep. UCB/EECS, 28:13 (2009)
Google Scholar
Gottfrid, D.: Self-service, prorated supercomputing fun! http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/. Accessed May 2015
Huang, D., Shi, X., Ibrahim, S., Lu, L., Liu, H., Wu, S., Jin, H.: Mr-scope: a real-time tracing tool for mapreduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, Chicago, Illinois, pp. 849–855 (2010)
Google Scholar
Ibrahim, S., He, B., Jin, H.: Towards pay-as-you-consume cloud computing. In: Proceedings of the 2011 IEEE International Conference on Services Computing (SCC 2011), Washington, DC, USA, pp. 370–377 (2011)
Google Scholar
Ibrahim, S., Jin, H., Lu, L., He, B., Antoniu, G., Song, W.: Maestro: replica-aware map scheduling for mapreduce. In: Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2012), Ottawa, Canada, pp. 59–72 (2012)
Google Scholar
Ibrahim, S., Jin, H., Lu, L., Qi, L., Wu, S., Shi, X.: Evaluating mapreduce on virtual machines: the hadoop case. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) CloudCom 2009. LNCS, vol. 5931, pp. 519–528. Springer, Heidelberg (2009)
Chapter Google Scholar
Jin, H., Ibrahim, S., Qi, L., Cao, H., Wu, S., Shi, X.: The mapreduce programming model and implementations. In: Buyya, R., Broberg, J., Goscinski, A.M. (eds.) Cloud Computing: Principles and Paradigms, pp. 373–390. John Wiley & Sons, USA (2011)
Chapter Google Scholar
Jindal, A., Quiané-Ruiz, J.-A., Dittrich, J.: Trojan data layouts: right shoes for a running elephant. In: The 2nd ACM Symposium on Cloud Computing, SOCC 2011, pp. 21:1–21:14. ACM, New York (2011)
Google Scholar
Ko, S.Y., Hoque, I., Cho, B., Gupta, I.: Making cloud intermediate data fault-tolerant. In: The 1st ACM Symposium on Cloud computing (SOCC 2010), pp. 181–192. ACM (2010)
Google Scholar
Lai, E.: Companies are spending a lot on Big Data. http://sites.tcs.com/big-data-study/spending-on-big-data/. Accessed May 2015
Logothetis, D., Olston, C., Reed, B., Webb, K.C., Yocum, K.: Stateful bulk processing for incremental analytics. In: The 1st ACM Symposium on Cloud Computing (SOCC 2010), pp. 51–62. ACM (2010)
Google Scholar
Schad, J., Dittrich, J., Quiané-Ruiz, J.-A.: Runtime measurements in the cloud: observing, analyzing, and reducing variance. PVLDB 3(1), 460–471 (2010)
Google Scholar
Thirumala Rao, B., Sridevi, N.V., Krishna Reddy, V., Reddy, L.S.S.: Performance issues of heterogeneous hadoop clusters in cloud computing. ArXiv e-prints, July 2012
Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive-a petabyte scale data warehouse using hadoop. In: IEEE 26th International Conference on Data Engineering (ICDE 2010), pp. 996–1005. IEEE (2010)
Google Scholar
Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., Stoica, I.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European Conference on Computer Systems (EuroSys 2010), pp. 265–278. ACM (2010)
Google Scholar
Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving mapreduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI 2008), San Diego, California, pp. 29–42 (2008)
Google Scholar
Zhu H., Chen, H.: Adaptive failure detection via heartbeat under hadoop. In: 2011 IEEE Asia-Pacific Services Computing Conference (APSCC), pp. 231–238. IEEE (2011)
Google Scholar

Download references

Acknowledgments

Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see http://www.grid5000.fr/).

Author information

Authors and Affiliations

Inria Rennes - Bretagne Atlantique Research Center, Campus Universitaires de Beaulieu, 35042, Rennes, France
Shadi Ibrahim, Tran Anh Phuong & Gabriel Antoniu

Authors

Shadi Ibrahim
View author publications
You can also search for this author in PubMed Google Scholar
Tran Anh Phuong
View author publications
You can also search for this author in PubMed Google Scholar
Gabriel Antoniu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shadi Ibrahim .

Editor information

Editors and Affiliations

University Politehnica of Bucharest, Bucharest, Romania
Florin Pop
Université Pierre et Marie CURIE, Paris, France
Maria Potop-Butucaru

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ibrahim, S., Phuong, T.A., Antoniu, G. (2015). An Eye on the Elephant in the Wild: A Performance Evaluation of Hadoop’s Schedulers Under Failures. In: Pop, F., Potop-Butucaru, M. (eds) Adaptive Resource Management and Scheduling for Cloud Computing. ARMS-CC 2015. Lecture Notes in Computer Science(), vol 9438. Springer, Cham. https://doi.org/10.1007/978-3-319-28448-4_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-28448-4_11
Published: 08 January 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28447-7
Online ISBN: 978-3-319-28448-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics