Choosing Optimal Maintenance Time for Stateless Data-Processing Clusters

Zhuang, Zhenyun; Shen, Min; Ramachandra, Haricharan; Viswesan, Suja

doi:10.1007/978-3-319-61756-5_14

Zhenyun Zhuang¹⁵,
Min Shen¹⁵,
Haricharan Ramachandra¹⁵ &
…
Suja Viswesan¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10353))

Included in the following conference series:

556 Accesses

Abstract

Stateless clusters such as Hadoop clusters are widely deployed to drive the business data analysis. When a cluster needs to be restarted for cluster-wide maintenance, it is desired for the administrators to choose a maintenance window that results in: (1) least disturbance to the cluster operation; and (2) maximized job processing throughput. A straightforward but naive approach is to choose maintenance time that has the least number of running jobs, but such an approach is suboptimal.

In this work, we use Hadoop as an use case and propose to determine the optimal cluster maintenance time based on the accumulated job progress, as opposed the number of running jobs. The approach can maximize the job throughput of a stateless cluster by minimizing the amount of lost works due to maintenance. Compared to the straightforward approach, the proposed approach can save up to 50% of wasted cluster resources caused by maintenance according to production cluster traces.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The previous version of Hadoop 1 does not have Resource Manager.
2.
There are other frameworks such as Spark based, but they are not gaining significant popularity at this time.
3.
Due to performance concerns, the total JVM heap size allocated to all containers on a Data Node should not exceed the physical RAM size of the node.
4.
Efforts are going on to allow job state persistence [9], however facing the challenges of implementation complexity, usability and adoption cost.
5.
The problem considered won’t change even with non-fixed maintenance duration; but having this assumption simplifies the presentation.
6.
The shuffling and reducing phases may overlap, so for simplification, we define the reducing time as the maximum of reducing time and shuffling time reported by job history server.
7.
Our particularly studied Hadoop cluster shows consistency of both day-to-day and week-to-week pattern.
8.
Those clusters that have sufficiently large of number of data nodes and run heterogenous workload.
9.
In production, the profiling efforts are running continuously.

References

Apache Hadoop. https://hadoop.apache.org/
HDFS Architecture Guide. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (SOSP 2003), Bolton Landing, pp. 29–43 (2003)
Google Scholar
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2), 4:1–4:26 (2008)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Opearating Systems Design & Implementation (OSDI 2004), San Francisco, vol. 6, p. 10 (2004)
Google Scholar
Apache HBase. http://hbase.apache.org/
Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC 2013) 2013
Google Scholar
Apache Spark. http://spark.apache.org/
Provide ability to persist running jobs. https://issues.apache.org/jira/browse/HADOOP-3245
Box, G.E.P., Jenkins, G.M.: Time Series Analysis: Forecasting and Control, 3rd edn. Prentice Hall PTR, Upper Saddle River (1994)
MATH Google Scholar
Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC 2013), Santa Clara (2013)
Google Scholar
Joshi, S.B.: Apache Hadoop performance-tuning methodologies and best practices. In: Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering (ICPE 2012), Boston (2012)
Google Scholar
Gu, R., Yang, X., Yan, J., Sun, Y., Wang, B., Yuan, C., Huang, Y.: SHadoop: improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters. J. Parallel Distrib. Comput. 74(3), 2166–2179 (2014)
Article Google Scholar
Wu, D., Luo, W., Xie, W., Ji, X., He, J., Wu, D.: Understanding the impacts of solid-state storage on the Hadoop performance. In: Proceedings of the 2013 International Conference on Advanced Cloud and Big Data (CBD 2013), Washington, DC (2013)
Google Scholar
Sharma, B., Wood, T., Das, C.R.: HybridMR: a hierarchical MapReduce scheduler for hybrid data centers. In: Proceedings of the 2013 IEEE 33rd International Conference on Distributed Computing Systems (ICDCS 2013), Washington, DC, pp. 102–111 (2013)
Google Scholar
Durbin, J., Koopman, S.J.: Time Series Analysis by State Space Methods. Oxford University Press, Oxford (2001)
MATH Google Scholar
Papagiannaki, K., Taft, N., Zhang, Z.-L., Diot, C.: Long-term forecasting of internet backbone traffic. Trans. Neural Netw. 16(5), 1110–1124 (2005)
Article Google Scholar
Zhuang, Z., Ramachandra, H., Tran, C., Subramaniam, S., Botev, C., Xiong, C., Sridharan, B.: Capacity planning and headroom analysis for taming database replication latency: experiences with Linkedin internet traffic. In: Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering (ICPE 2015), Austin, pp. 39–50 (2015)
Google Scholar
Qiao, L., Surlaker, K., Das, S., et al.: On brewing fresh espresso: Linkedin’s distributed data serving platform. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD 2013), pp. 1135–1146 (2013)
Google Scholar
Das, S., Botev, C., et al.: All aboard the databus! Linkedin’s scalable consistent change data capture platform (SoCC 2012), New York (2012)
Google Scholar
Xia, T., Jin, X., Xi, L., Zhang, Y., Ni, J.: Operating load based real-time rolling grey forecasting for machine health prognosis in dynamic maintenance schedule. J. Intell. Manuf. 26(2), 269–280 (2015)
Article Google Scholar
Herbst, N.R., Huber, N., Kounev, S., Amrehn, E.: Self-adaptive workload classification and forecasting for proactive resource provisioning. In: Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering (ICPE 2013), pp. 187–198. ACM, New York (2013). http://doi.acm.org/10.1145/2479871.2479899
Moghaddam, K.S., Usher, J.S.: Preventive maintenance and replacement scheduling for repairable and maintainable systems using dynamic programming. Comput. Ind. Eng. 60(4), 654–665 (2011)
Article Google Scholar
Carr, M., Wagner, C.: A study of reasoning processes in software maintenance management. Inf. Technol. Manag. 3(1–2), 181–203 (2002)
Article Google Scholar

Download references

Author information

Authors and Affiliations

LinkedIn Corporation, 2029 Stierlin Court, Mountain View, CA, 94043, USA
Zhenyun Zhuang, Min Shen, Haricharan Ramachandra & Suja Viswesan

Authors

Zhenyun Zhuang
View author publications
You can also search for this author in PubMed Google Scholar
Min Shen
View author publications
You can also search for this author in PubMed Google Scholar
Haricharan Ramachandra
View author publications
You can also search for this author in PubMed Google Scholar
Suja Viswesan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhenyun Zhuang .

Editor information

Editors and Affiliations

Google, Seattle, USA
Narayan Desai
Google, Mountain View, USA
Walfredo Cirne

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhuang, Z., Shen, M., Ramachandra, H., Viswesan, S. (2017). Choosing Optimal Maintenance Time for Stateless Data-Processing Clusters. In: Desai, N., Cirne, W. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP JSSPP 2015 2016. Lecture Notes in Computer Science(), vol 10353. Springer, Cham. https://doi.org/10.1007/978-3-319-61756-5_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-61756-5_14
Published: 12 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-61755-8
Online ISBN: 978-3-319-61756-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics