Skip to main content

Optimization of Data Center Fault Tolerance Design

  • Chapter
  • First Online:
Engineering and Management of Data Centers

Abstract

Balancing costs and quality of offered IT service is a challenging task for data center providers. In the case of availability, fault tolerance can be applied by introducing redundancy mechanisms into the service design. Redundancy allocation problems can be defined as combinatorial optimization problems to identify cost-effective redundancy configurations in which availability objectives are met. However, these approaches should be flexible to trade-off effort and benefit in a specific scenario. Therefore, a redundancy allocation problem is proposed in this chapter that is capable of modeling the specific characteristics of the IT system to be analyzed. In order to identify suitable design configurations, a generic Petri net simulation model is combined with a genetic algorithm. By defining the solution algorithm adaptively to the complexity of the considered problem definition, users are able to reduce modeling as well as computational effort. The suitability of the approach is demonstrated in the use-case of an international application service provider.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Abdelkader, R., et al.: Search Algorithms for Engineering Optimization, pp. 241–258. InTech, Rijeka, Croatia (2013)

    Google Scholar 

  • Anon: Military Standard: Reliability Modeling and Prediction (MIL-STD-756B), U.S. Department of Defense, Washington D.C., USA (1981)

    Google Scholar 

  • Ardakan, M.A., Hamadani, A.Z.: Reliability–redundancy allocation problem with cold-standby redundancy strategy. Simul. Model. Pract. Theory. 42, 107–118 (2014)

    Article  Google Scholar 

  • Barroso, L.A., Clidaras, J., Hölzle, U.: In: Hill, M.D. (ed.) The Datacenter as a Computer, 2nd edn. Morgan & Claypool Publishers, San Rafael (2013)

    Google Scholar 

  • Bondavalli, A., et al.: Threshold-based mechanisms to discriminate transient from intermittent faults. IEEE Trans. Comput. 49(3), 230–245 (2000)

    Article  Google Scholar 

  • Bosse, S., Splieth, M., Turowski, K.: Multi-objective optimization of IT service availability and costs. Reliab. Eng. Syst. Saf. 147, 142–155 (2016)

    Article  Google Scholar 

  • Callou, G., et al.: A petri net-based approach to the quantification of data center dependability. In: Pawlewski, P. (ed.) Petri Nets - Manufacturing and Computer Science, pp. 313–336. InTech, Rijeka (2012)

    Google Scholar 

  • Cao, D., Murat, A., Chinnam, R.B.: Efficient exact optimization of multi-objective redundancy allocation problems in series-parallel systems. Reliab. Eng. Syst. Saf. 111, 154–163 (2013)

    Article  Google Scholar 

  • Caserta, M., Voß, S.: An exact algorithm for the reliability redundancy allocation problem. Eur. J. Oper. Res. 244, 110–116 (2015)

    Article  MATH  MathSciNet  Google Scholar 

  • Chambari, A., et al.: A bi-objective model to optimize reliability and cost of system with a choice of redundancy strategies. Comput. Ind. Eng. 63, 109–119 (2012)

    Article  Google Scholar 

  • Chellappan, C., Vijayalakshmi, G.: Dependability modeling and analysis of hybrid redundancy systems. Int. J. Qual. Reliab. Manag. 26, 76–96 (2009)

    Article  Google Scholar 

  • Chen, T.-C.: IAs based approach for reliability redundancy allocation problems. Appl. Math. Comput. 182, 1556–1567 (2006)

    MATH  Google Scholar 

  • Chen, T.-C., You, P.-S.: Immune algorithms-based approach for redundant reliability problems with multiple component choices. Comput. Ind. 56, 195–205 (2005)

    Article  Google Scholar 

  • Chern, M.-S.: On the computational complexity of reliability redundancy allocation in a series system. Oper. Res. Lett. 11, 309–315 (1992)

    Article  MATH  MathSciNet  Google Scholar 

  • Chi, D.-H., Kuo, W.: Optimal design for software reliability and development cost. IEEE J. Sel. Areas Commun. 8(2), 276–282 (1990)

    Article  Google Scholar 

  • Ciardo, G., Muppala, J.K., Trivedi, K.S.: SPNP: stochastic petri net package. In: Proceedings of the 3rd International Workshop PNPM, pp. 142–151. IEEE Computer Society (1989)

    Google Scholar 

  • Coit, D.W., Konak, A.: Multiple weighted objectives heuristic for the redundancy allocation problem. IEEE Trans. Reliab. 55, 551–558 (2006)

    Article  Google Scholar 

  • Coit, D.W., Smith, A.E.: Reliability optimization of series-parallel systems using a genetic algorithm. IEEE Trans. Reliab. 45, 254–266 (1996)

    Article  Google Scholar 

  • Deb, K. et al.: A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: NSGA-II. In: Proceedings of the 6th International Conference on Parallel Problem Solving from Nature. Lecture Notes in Computer Science. Springer, Berlin/Heidelberg (2000)

    Google Scholar 

  • Emeakaroha, V.C., et al.: Towards autonomic detection of SLA violations in cloud infrastructures. Futur. Gener. Comput. Syst. 28(7), 1017–1029 (2012)

    Article  Google Scholar 

  • Fan, X., Weber, W.-D., Barroso, L.A.: Power provisioning for a warehouse-sized computer. In: Proceedings of the 34th International Symposium on Computer Architecture. San Diego, CA, USA, pp. 13–23 (2007)

    Google Scholar 

  • Fonseca, C.M., Fleming, P.J.: An overview of evolutionary algorithms in multiobjective optimization. Evol. Comput. 3(1), 1–16 (1995)

    Article  Google Scholar 

  • Franke, U.: Optimal IT service availability: shorter outages, or fewer? IEEE Trans. Netw. Serv. Manag. 9, 22–33 (2012)

    Article  Google Scholar 

  • Franke, U., Johnson, P., König, J.: An architecture framework for enterprise IT service availability analysis. Softw. Syst. Model. 13, 1417–1445 (2014)

    Article  Google Scholar 

  • Garg, H., Sharma, S.P.: Multi-objective reliability-redundancy allocation problem using particle swarm optimization. Comput. Ind. Eng. 64, 247–255 (2013)

    Article  Google Scholar 

  • Hoffmann, G.A., Salfner, F., Malek, M.: Advanced Failure Prediction in Complex Software Systems. Informatik-Bericht 172 der Humboldt-Universität zu Berlin (2004)

    Google Scholar 

  • Hunnebeck, L.: ITIL Service Design 2011 Edition. The Stationery Office, Norwich (2011)

    Google Scholar 

  • Immonen, A., Niemelä, E.: Survey of reliability and availability prediction methods from the viewpoint of software architecture. Softw. Syst. Model. 7, 49–65 (2008)

    Article  Google Scholar 

  • Jewell, D.: Performance modeling and engineering. In: Liu, Z., Xia, C.H. (eds.) pp. 29–55. Springer, Boston (2008)

    Google Scholar 

  • Jiansheng, G., et al.: Uncertain multiobjective redundancy allocation problem of repairable systems based on artificial bee colony algorithm. Chin. J. Aeronaut. 27(6), 1477–1487 (2014)

    Article  Google Scholar 

  • Kettelle, J.D.J.: Least-cost allocations of reliability investment. Oper. Res. 10(2), 249–265 (1962)

    Article  Google Scholar 

  • Krcmar, H.: Informationsmanagement, 6th edn. Springer, Berlin (2015)

    Google Scholar 

  • Kulturel-Konak, S., Smith, A.E., Coit, D.W.: Efficiently solving the redundancy allocation problem using tabu search. IIE Trans. 35, 515–526 (2003)

    Article  Google Scholar 

  • Kulturel-Konak, S., Smith, A.E., Normal, B.A.: Multi-objective tabu search using a multinomial probability mass function. Eur. J. Oper. Res. 169, 918–931 (2006)

    Article  MATH  MathSciNet  Google Scholar 

  • Kwakernaak, H.: Fuzzy random variables-I. Definitions and theorems. Inf. Sci. 15(1), 1–29 (1978)

    Article  MATH  MathSciNet  Google Scholar 

  • Laprie, J.-C.: Dependable computing: concepts, limits, challenges. In: 25th IEEE International Symposium on Fault-Tolerant Computing. Pasadena, CA, USA, pp. 42–54 (1995)

    Google Scholar 

  • Lee, P.A., Anderson, T.: Fault Tolerance: Principles and Practice, 2nd edn. Springer-Verlag, Wien (1990)

    Book  MATH  Google Scholar 

  • Lewis, L.: Service level management definition, architecture and research challenges. In: IEEE Global Telecommunications Conference, pp. 1974–1978 (1999)

    Google Scholar 

  • Liang, Y.-C., Smith, A.E.: An ant colony optimization algorithm for the redundancy allocation problem (RAP). IEEE Trans. Reliab. 53, 417–423 (2004)

    Article  Google Scholar 

  • Lins, I.D., Droguett, E.L.: Multiobjective optimization of availability and cost in repairable systems design via genetic algorithms and discrete event simulation. Pesqui. Oper. 29, 43–66 (2009)

    Article  Google Scholar 

  • Littlewood, B.: Comments on “Reliability and performance analysis for fault-tolerant programs consisting of versions with different characteristics” by Gregory Levitin. Reliab. Eng. Syst. Saf. 91, 119–120 (2006)

    Article  Google Scholar 

  • Milanovic, N., Milic, B.: Automatic generation of service availability models. IEEE Trans. Serv. Comput. 4(1), 56–69 (2011)

    Article  Google Scholar 

  • Onishi, J., et al.: Solving the redundancy allocation problem with a mix of components using the improved surrogate constraint method. IEEE Trans. Reliab. 56(1), 94–101 (2007)

    Article  Google Scholar 

  • Oppenheimer, D., Ganapathi, A., Patterson, D.A.: Why do internet services fail, and what can be done about it? In: 4th Usenix Symposium on Internet Technologies and Systems (USITS) (2003)

    Google Scholar 

  • Orgerie, A.-C., De Assuncao, M.D., Lefevre, L.: A survey on techniques for improving the energy efficiency of large scale distributed systems. ACM Comput. Surv. 46(4), 1–35 (2014)

    Article  Google Scholar 

  • Ouzineb, M., Nourelfath, M., Gendreau, M.: Tabu search for the redundancy allocation problem of homogenous series–parallel multi-state systems. Reliab. Eng. Syst. Saf. 93, 1257–1272 (2008)

    Article  Google Scholar 

  • Painton, L., Campbell, J.: Genetic algorithms in optimization of system reliability. IEEE Trans. Reliab. 44, 172–178 (1995)

    Article  Google Scholar 

  • Pinheiro, E., Weber, W.-D., Barroso, L.A.: Failure trends in a large disk drive population. In: Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST) (2007)

    Google Scholar 

  • Ramirez-Marquez, J.E., Coit, D.W.: A heuristic for solving the redundancy allocation problem for multi-state series-parallel systems. Reliab. Eng. Syst. Saf. 83, 341–349 (2004)

    Article  Google Scholar 

  • Ravi, V., Murty, B.S.N., Reddy, P.J.: Nonequilibrium simulated annealing-algorithm applied to reliability optimization of complex system. IEEE Trans. Reliab. 46, 233–239 (1997)

    Article  Google Scholar 

  • Sachdeva, A., Kumar, D., Kumar, P.: Reliability analysis of pulping system using Petri nets. Int. J. Qual. Reliab. Manag. 25, 860–877 (2008)

    Article  Google Scholar 

  • Sadjadi, S.J., Soltani, R.: Minimum–maximum regret redundancy allocation with the choice of redundancy strategy and multiple choice of component type under uncertainty. Comput. Ind. Eng. 79, 204–213 (2015)

    Article  Google Scholar 

  • Sahoo, L., Bhunia, A.K., Roy, D.: A genetic algorithm based reliability redundancy optimization for interval valued reliabilities of components. J. Appl. Quant. Methods. 5, 270–287 (2010)

    Google Scholar 

  • Schroeder, B., Pinheiro, E., Weber, W.-D.: DRAM errors in the wild: a large-scale field study. Commun. ACM. 54, 100–107 (2011)

    Article  Google Scholar 

  • Shooman, M.L.: Reliability of Computer Systems and Networks – Fault Tolerance, Analysis, and Design. Wiley, New York (2002)

    Google Scholar 

  • Silic, M., et al.: Scalable and accurate prediction of availability of atomic web services. IEEE Trans. Serv. Comput. 7(2), 252–264 (2014)

    Article  Google Scholar 

  • Soltani, R.: Reliability optimization of binary state non-repairable systems: a state of the art survey. Int. J. Ind. Eng. Comput. 5, 339–364 (2014)

    Google Scholar 

  • Sooktip, T., et al.: Multi-objective optimization for k-out-of-n redundancy allocation problem. In: International Conference on Quality, Reliability, Risk, Maintenance, and Safety Engineering (ICQR2MSE), pp. 1050–1054. IEEE, Chengdu (2012)

    Google Scholar 

  • Taguchi, T., Yokota, T.: Optimal design problem of system reliability with interval coefficient using improved genetic algorithms. Comput. Ind. Eng. 37, 145–149 (1999)

    Article  Google Scholar 

  • Terlit, D., Krcmar, H.: Generic performance prediction for ERP and SOA applications. In: Proceedings of the 18th European Conference on Information Systems (ECIS) (2011)

    Google Scholar 

  • Tian, Z., Levitin, G., Zuo, M.J.: A joint reliability–redundancy optimization approach for multi-state series–parallel systems. Reliab. Eng. Syst. Saf. 94, 1568–1576 (2009)

    Article  Google Scholar 

  • Trivedi, K. et al.: Achieving and assuring high availability. In: Nanya, T., et al. (eds.) 5th International Service Availability Symposium (ISAS). Lecture Notes in Computer Science, pp. 20–25. Springer Verlag, Tokyo/Berlin/Heidelberg (2008)

    Google Scholar 

  • Wang, S., Watada, J.: Modelling redundancy allocation for a fuzzy random parallel-series system. J. Comput. Appl. Math. 232, 539–557 (2009)

    Article  MATH  Google Scholar 

  • Zhao, R., Liu, B.: Redundancy optimization problems with uncertainty of combining randomness and fuzziness. Eur. J. Oper. Res. 157, 716–735 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  • Ziaee, M.: Optimal redundancy allocation in hierarchical series–parallel systems using mixed integer programming. Appl. Math. 4, 79–83 (2013)

    Article  Google Scholar 

  • Zille, V., et al.: Simulation of maintained multicomponent systems for dependability assessment. In: Faulin, P., et al. (eds.) Simulation Methods for Reliability and Availability of Complex Systems, pp. 253–272. Springer, Berlin/Heidelberg (2010)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sascha Bosse .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Bosse, S., Turowski, K. (2017). Optimization of Data Center Fault Tolerance Design. In: Marx Gómez, J., Mora, M., Raisinghani, M., Nebel, W., O'Connor, R. (eds) Engineering and Management of Data Centers. Service Science: Research and Innovations in the Service Economy. Springer, Cham. https://doi.org/10.1007/978-3-319-65082-1_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-65082-1_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-65081-4

  • Online ISBN: 978-3-319-65082-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics