Skip to main content

Choosing Optimal Maintenance Time for Stateless Data-Processing Clusters

A Case Study of Hadoop Cluster

  • Conference paper
  • First Online:
Job Scheduling Strategies for Parallel Processing (JSSPP 2015, JSSPP 2016)

Abstract

Stateless clusters such as Hadoop clusters are widely deployed to drive the business data analysis. When a cluster needs to be restarted for cluster-wide maintenance, it is desired for the administrators to choose a maintenance window that results in: (1) least disturbance to the cluster operation; and (2) maximized job processing throughput. A straightforward but naive approach is to choose maintenance time that has the least number of running jobs, but such an approach is suboptimal.

In this work, we use Hadoop as an use case and propose to determine the optimal cluster maintenance time based on the accumulated job progress, as opposed the number of running jobs. The approach can maximize the job throughput of a stateless cluster by minimizing the amount of lost works due to maintenance. Compared to the straightforward approach, the proposed approach can save up to 50% of wasted cluster resources caused by maintenance according to production cluster traces.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The previous version of Hadoop 1 does not have Resource Manager.

  2. 2.

    There are other frameworks such as Spark based, but they are not gaining significant popularity at this time.

  3. 3.

    Due to performance concerns, the total JVM heap size allocated to all containers on a Data Node should not exceed the physical RAM size of the node.

  4. 4.

    Efforts are going on to allow job state persistence [9], however facing the challenges of implementation complexity, usability and adoption cost.

  5. 5.

    The problem considered won’t change even with non-fixed maintenance duration; but having this assumption simplifies the presentation.

  6. 6.

    The shuffling and reducing phases may overlap, so for simplification, we define the reducing time as the maximum of reducing time and shuffling time reported by job history server.

  7. 7.

    Our particularly studied Hadoop cluster shows consistency of both day-to-day and week-to-week pattern.

  8. 8.

    Those clusters that have sufficiently large of number of data nodes and run heterogenous workload.

  9. 9.

    In production, the profiling efforts are running continuously.

References

  1. Apache Hadoop. https://hadoop.apache.org/

  2. HDFS Architecture Guide. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

  3. Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (SOSP 2003), Bolton Landing, pp. 29–43 (2003)

    Google Scholar 

  4. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2), 4:1–4:26 (2008)

    Article  Google Scholar 

  5. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Opearating Systems Design & Implementation (OSDI 2004), San Francisco, vol. 6, p. 10 (2004)

    Google Scholar 

  6. Apache HBase. http://hbase.apache.org/

  7. Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC 2013) 2013

    Google Scholar 

  8. Apache Spark. http://spark.apache.org/

  9. Provide ability to persist running jobs. https://issues.apache.org/jira/browse/HADOOP-3245

  10. Box, G.E.P., Jenkins, G.M.: Time Series Analysis: Forecasting and Control, 3rd edn. Prentice Hall PTR, Upper Saddle River (1994)

    MATH  Google Scholar 

  11. Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC 2013), Santa Clara (2013)

    Google Scholar 

  12. Joshi, S.B.: Apache Hadoop performance-tuning methodologies and best practices. In: Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering (ICPE 2012), Boston (2012)

    Google Scholar 

  13. Gu, R., Yang, X., Yan, J., Sun, Y., Wang, B., Yuan, C., Huang, Y.: SHadoop: improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters. J. Parallel Distrib. Comput. 74(3), 2166–2179 (2014)

    Article  Google Scholar 

  14. Wu, D., Luo, W., Xie, W., Ji, X., He, J., Wu, D.: Understanding the impacts of solid-state storage on the Hadoop performance. In: Proceedings of the 2013 International Conference on Advanced Cloud and Big Data (CBD 2013), Washington, DC (2013)

    Google Scholar 

  15. Sharma, B., Wood, T., Das, C.R.: HybridMR: a hierarchical MapReduce scheduler for hybrid data centers. In: Proceedings of the 2013 IEEE 33rd International Conference on Distributed Computing Systems (ICDCS 2013), Washington, DC, pp. 102–111 (2013)

    Google Scholar 

  16. Durbin, J., Koopman, S.J.: Time Series Analysis by State Space Methods. Oxford University Press, Oxford (2001)

    MATH  Google Scholar 

  17. Papagiannaki, K., Taft, N., Zhang, Z.-L., Diot, C.: Long-term forecasting of internet backbone traffic. Trans. Neural Netw. 16(5), 1110–1124 (2005)

    Article  Google Scholar 

  18. Zhuang, Z., Ramachandra, H., Tran, C., Subramaniam, S., Botev, C., Xiong, C., Sridharan, B.: Capacity planning and headroom analysis for taming database replication latency: experiences with Linkedin internet traffic. In: Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering (ICPE 2015), Austin, pp. 39–50 (2015)

    Google Scholar 

  19. Qiao, L., Surlaker, K., Das, S., et al.: On brewing fresh espresso: Linkedin’s distributed data serving platform. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD 2013), pp. 1135–1146 (2013)

    Google Scholar 

  20. Das, S., Botev, C., et al.: All aboard the databus! Linkedin’s scalable consistent change data capture platform (SoCC 2012), New York (2012)

    Google Scholar 

  21. Xia, T., Jin, X., Xi, L., Zhang, Y., Ni, J.: Operating load based real-time rolling grey forecasting for machine health prognosis in dynamic maintenance schedule. J. Intell. Manuf. 26(2), 269–280 (2015)

    Article  Google Scholar 

  22. Herbst, N.R., Huber, N., Kounev, S., Amrehn, E.: Self-adaptive workload classification and forecasting for proactive resource provisioning. In: Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering (ICPE 2013), pp. 187–198. ACM, New York (2013). http://doi.acm.org/10.1145/2479871.2479899

  23. Moghaddam, K.S., Usher, J.S.: Preventive maintenance and replacement scheduling for repairable and maintainable systems using dynamic programming. Comput. Ind. Eng. 60(4), 654–665 (2011)

    Article  Google Scholar 

  24. Carr, M., Wagner, C.: A study of reasoning processes in software maintenance management. Inf. Technol. Manag. 3(1–2), 181–203 (2002)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhenyun Zhuang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Zhuang, Z., Shen, M., Ramachandra, H., Viswesan, S. (2017). Choosing Optimal Maintenance Time for Stateless Data-Processing Clusters. In: Desai, N., Cirne, W. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP JSSPP 2015 2016. Lecture Notes in Computer Science(), vol 10353. Springer, Cham. https://doi.org/10.1007/978-3-319-61756-5_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-61756-5_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-61755-8

  • Online ISBN: 978-3-319-61756-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics