Optimization of Data Center Fault Tolerance Design

Part of the Service Science: Research and Innovations in the Service Economy book series (SSRI)


Balancing costs and quality of offered IT service is a challenging task for data center providers. In the case of availability, fault tolerance can be applied by introducing redundancy mechanisms into the service design. Redundancy allocation problems can be defined as combinatorial optimization problems to identify cost-effective redundancy configurations in which availability objectives are met. However, these approaches should be flexible to trade-off effort and benefit in a specific scenario. Therefore, a redundancy allocation problem is proposed in this chapter that is capable of modeling the specific characteristics of the IT system to be analyzed. In order to identify suitable design configurations, a generic Petri net simulation model is combined with a genetic algorithm. By defining the solution algorithm adaptively to the complexity of the considered problem definition, users are able to reduce modeling as well as computational effort. The suitability of the approach is demonstrated in the use-case of an international application service provider.


Availability management Redundancy allocation Fault tolerance Design optimization 


  1. Abdelkader, R., et al.: Search Algorithms for Engineering Optimization, pp. 241–258. InTech, Rijeka, Croatia (2013)Google Scholar
  2. Anon: Military Standard: Reliability Modeling and Prediction (MIL-STD-756B), U.S. Department of Defense, Washington D.C., USA (1981)Google Scholar
  3. Ardakan, M.A., Hamadani, A.Z.: Reliability–redundancy allocation problem with cold-standby redundancy strategy. Simul. Model. Pract. Theory. 42, 107–118 (2014)CrossRefGoogle Scholar
  4. Barroso, L.A., Clidaras, J., Hölzle, U.: In: Hill, M.D. (ed.) The Datacenter as a Computer, 2nd edn. Morgan & Claypool Publishers, San Rafael (2013)Google Scholar
  5. Bondavalli, A., et al.: Threshold-based mechanisms to discriminate transient from intermittent faults. IEEE Trans. Comput. 49(3), 230–245 (2000)CrossRefGoogle Scholar
  6. Bosse, S., Splieth, M., Turowski, K.: Multi-objective optimization of IT service availability and costs. Reliab. Eng. Syst. Saf. 147, 142–155 (2016)CrossRefGoogle Scholar
  7. Callou, G., et al.: A petri net-based approach to the quantification of data center dependability. In: Pawlewski, P. (ed.) Petri Nets - Manufacturing and Computer Science, pp. 313–336. InTech, Rijeka (2012)Google Scholar
  8. Cao, D., Murat, A., Chinnam, R.B.: Efficient exact optimization of multi-objective redundancy allocation problems in series-parallel systems. Reliab. Eng. Syst. Saf. 111, 154–163 (2013)CrossRefGoogle Scholar
  9. Caserta, M., Voß, S.: An exact algorithm for the reliability redundancy allocation problem. Eur. J. Oper. Res. 244, 110–116 (2015)CrossRefzbMATHMathSciNetGoogle Scholar
  10. Chambari, A., et al.: A bi-objective model to optimize reliability and cost of system with a choice of redundancy strategies. Comput. Ind. Eng. 63, 109–119 (2012)CrossRefGoogle Scholar
  11. Chellappan, C., Vijayalakshmi, G.: Dependability modeling and analysis of hybrid redundancy systems. Int. J. Qual. Reliab. Manag. 26, 76–96 (2009)CrossRefGoogle Scholar
  12. Chen, T.-C.: IAs based approach for reliability redundancy allocation problems. Appl. Math. Comput. 182, 1556–1567 (2006)zbMATHGoogle Scholar
  13. Chen, T.-C., You, P.-S.: Immune algorithms-based approach for redundant reliability problems with multiple component choices. Comput. Ind. 56, 195–205 (2005)CrossRefGoogle Scholar
  14. Chern, M.-S.: On the computational complexity of reliability redundancy allocation in a series system. Oper. Res. Lett. 11, 309–315 (1992)CrossRefzbMATHMathSciNetGoogle Scholar
  15. Chi, D.-H., Kuo, W.: Optimal design for software reliability and development cost. IEEE J. Sel. Areas Commun. 8(2), 276–282 (1990)CrossRefGoogle Scholar
  16. Ciardo, G., Muppala, J.K., Trivedi, K.S.: SPNP: stochastic petri net package. In: Proceedings of the 3rd International Workshop PNPM, pp. 142–151. IEEE Computer Society (1989)Google Scholar
  17. Coit, D.W., Konak, A.: Multiple weighted objectives heuristic for the redundancy allocation problem. IEEE Trans. Reliab. 55, 551–558 (2006)CrossRefGoogle Scholar
  18. Coit, D.W., Smith, A.E.: Reliability optimization of series-parallel systems using a genetic algorithm. IEEE Trans. Reliab. 45, 254–266 (1996)CrossRefGoogle Scholar
  19. Deb, K. et al.: A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: NSGA-II. In: Proceedings of the 6th International Conference on Parallel Problem Solving from Nature. Lecture Notes in Computer Science. Springer, Berlin/Heidelberg (2000)Google Scholar
  20. Emeakaroha, V.C., et al.: Towards autonomic detection of SLA violations in cloud infrastructures. Futur. Gener. Comput. Syst. 28(7), 1017–1029 (2012)CrossRefGoogle Scholar
  21. Fan, X., Weber, W.-D., Barroso, L.A.: Power provisioning for a warehouse-sized computer. In: Proceedings of the 34th International Symposium on Computer Architecture. San Diego, CA, USA, pp. 13–23 (2007)Google Scholar
  22. Fonseca, C.M., Fleming, P.J.: An overview of evolutionary algorithms in multiobjective optimization. Evol. Comput. 3(1), 1–16 (1995)CrossRefGoogle Scholar
  23. Franke, U.: Optimal IT service availability: shorter outages, or fewer? IEEE Trans. Netw. Serv. Manag. 9, 22–33 (2012)CrossRefGoogle Scholar
  24. Franke, U., Johnson, P., König, J.: An architecture framework for enterprise IT service availability analysis. Softw. Syst. Model. 13, 1417–1445 (2014)CrossRefGoogle Scholar
  25. Garg, H., Sharma, S.P.: Multi-objective reliability-redundancy allocation problem using particle swarm optimization. Comput. Ind. Eng. 64, 247–255 (2013)CrossRefGoogle Scholar
  26. Hoffmann, G.A., Salfner, F., Malek, M.: Advanced Failure Prediction in Complex Software Systems. Informatik-Bericht 172 der Humboldt-Universität zu Berlin (2004)Google Scholar
  27. Hunnebeck, L.: ITIL Service Design 2011 Edition. The Stationery Office, Norwich (2011)Google Scholar
  28. Immonen, A., Niemelä, E.: Survey of reliability and availability prediction methods from the viewpoint of software architecture. Softw. Syst. Model. 7, 49–65 (2008)CrossRefGoogle Scholar
  29. Jewell, D.: Performance modeling and engineering. In: Liu, Z., Xia, C.H. (eds.) pp. 29–55. Springer, Boston (2008)Google Scholar
  30. Jiansheng, G., et al.: Uncertain multiobjective redundancy allocation problem of repairable systems based on artificial bee colony algorithm. Chin. J. Aeronaut. 27(6), 1477–1487 (2014)CrossRefGoogle Scholar
  31. Kettelle, J.D.J.: Least-cost allocations of reliability investment. Oper. Res. 10(2), 249–265 (1962)CrossRefGoogle Scholar
  32. Krcmar, H.: Informationsmanagement, 6th edn. Springer, Berlin (2015)Google Scholar
  33. Kulturel-Konak, S., Smith, A.E., Coit, D.W.: Efficiently solving the redundancy allocation problem using tabu search. IIE Trans. 35, 515–526 (2003)CrossRefGoogle Scholar
  34. Kulturel-Konak, S., Smith, A.E., Normal, B.A.: Multi-objective tabu search using a multinomial probability mass function. Eur. J. Oper. Res. 169, 918–931 (2006)CrossRefzbMATHMathSciNetGoogle Scholar
  35. Kwakernaak, H.: Fuzzy random variables-I. Definitions and theorems. Inf. Sci. 15(1), 1–29 (1978)CrossRefzbMATHMathSciNetGoogle Scholar
  36. Laprie, J.-C.: Dependable computing: concepts, limits, challenges. In: 25th IEEE International Symposium on Fault-Tolerant Computing. Pasadena, CA, USA, pp. 42–54 (1995)Google Scholar
  37. Lee, P.A., Anderson, T.: Fault Tolerance: Principles and Practice, 2nd edn. Springer-Verlag, Wien (1990)CrossRefzbMATHGoogle Scholar
  38. Lewis, L.: Service level management definition, architecture and research challenges. In: IEEE Global Telecommunications Conference, pp. 1974–1978 (1999)Google Scholar
  39. Liang, Y.-C., Smith, A.E.: An ant colony optimization algorithm for the redundancy allocation problem (RAP). IEEE Trans. Reliab. 53, 417–423 (2004)CrossRefGoogle Scholar
  40. Lins, I.D., Droguett, E.L.: Multiobjective optimization of availability and cost in repairable systems design via genetic algorithms and discrete event simulation. Pesqui. Oper. 29, 43–66 (2009)CrossRefGoogle Scholar
  41. Littlewood, B.: Comments on “Reliability and performance analysis for fault-tolerant programs consisting of versions with different characteristics” by Gregory Levitin. Reliab. Eng. Syst. Saf. 91, 119–120 (2006)CrossRefGoogle Scholar
  42. Milanovic, N., Milic, B.: Automatic generation of service availability models. IEEE Trans. Serv. Comput. 4(1), 56–69 (2011)CrossRefGoogle Scholar
  43. Onishi, J., et al.: Solving the redundancy allocation problem with a mix of components using the improved surrogate constraint method. IEEE Trans. Reliab. 56(1), 94–101 (2007)CrossRefGoogle Scholar
  44. Oppenheimer, D., Ganapathi, A., Patterson, D.A.: Why do internet services fail, and what can be done about it? In: 4th Usenix Symposium on Internet Technologies and Systems (USITS) (2003)Google Scholar
  45. Orgerie, A.-C., De Assuncao, M.D., Lefevre, L.: A survey on techniques for improving the energy efficiency of large scale distributed systems. ACM Comput. Surv. 46(4), 1–35 (2014)CrossRefGoogle Scholar
  46. Ouzineb, M., Nourelfath, M., Gendreau, M.: Tabu search for the redundancy allocation problem of homogenous series–parallel multi-state systems. Reliab. Eng. Syst. Saf. 93, 1257–1272 (2008)CrossRefGoogle Scholar
  47. Painton, L., Campbell, J.: Genetic algorithms in optimization of system reliability. IEEE Trans. Reliab. 44, 172–178 (1995)CrossRefGoogle Scholar
  48. Pinheiro, E., Weber, W.-D., Barroso, L.A.: Failure trends in a large disk drive population. In: Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST) (2007)Google Scholar
  49. Ramirez-Marquez, J.E., Coit, D.W.: A heuristic for solving the redundancy allocation problem for multi-state series-parallel systems. Reliab. Eng. Syst. Saf. 83, 341–349 (2004)CrossRefGoogle Scholar
  50. Ravi, V., Murty, B.S.N., Reddy, P.J.: Nonequilibrium simulated annealing-algorithm applied to reliability optimization of complex system. IEEE Trans. Reliab. 46, 233–239 (1997)CrossRefGoogle Scholar
  51. Sachdeva, A., Kumar, D., Kumar, P.: Reliability analysis of pulping system using Petri nets. Int. J. Qual. Reliab. Manag. 25, 860–877 (2008)CrossRefGoogle Scholar
  52. Sadjadi, S.J., Soltani, R.: Minimum–maximum regret redundancy allocation with the choice of redundancy strategy and multiple choice of component type under uncertainty. Comput. Ind. Eng. 79, 204–213 (2015)CrossRefGoogle Scholar
  53. Sahoo, L., Bhunia, A.K., Roy, D.: A genetic algorithm based reliability redundancy optimization for interval valued reliabilities of components. J. Appl. Quant. Methods. 5, 270–287 (2010)Google Scholar
  54. Schroeder, B., Pinheiro, E., Weber, W.-D.: DRAM errors in the wild: a large-scale field study. Commun. ACM. 54, 100–107 (2011)CrossRefGoogle Scholar
  55. Shooman, M.L.: Reliability of Computer Systems and Networks – Fault Tolerance, Analysis, and Design. Wiley, New York (2002)Google Scholar
  56. Silic, M., et al.: Scalable and accurate prediction of availability of atomic web services. IEEE Trans. Serv. Comput. 7(2), 252–264 (2014)CrossRefGoogle Scholar
  57. Soltani, R.: Reliability optimization of binary state non-repairable systems: a state of the art survey. Int. J. Ind. Eng. Comput. 5, 339–364 (2014)Google Scholar
  58. Sooktip, T., et al.: Multi-objective optimization for k-out-of-n redundancy allocation problem. In: International Conference on Quality, Reliability, Risk, Maintenance, and Safety Engineering (ICQR2MSE), pp. 1050–1054. IEEE, Chengdu (2012)Google Scholar
  59. Taguchi, T., Yokota, T.: Optimal design problem of system reliability with interval coefficient using improved genetic algorithms. Comput. Ind. Eng. 37, 145–149 (1999)CrossRefGoogle Scholar
  60. Terlit, D., Krcmar, H.: Generic performance prediction for ERP and SOA applications. In: Proceedings of the 18th European Conference on Information Systems (ECIS) (2011)Google Scholar
  61. Tian, Z., Levitin, G., Zuo, M.J.: A joint reliability–redundancy optimization approach for multi-state series–parallel systems. Reliab. Eng. Syst. Saf. 94, 1568–1576 (2009)CrossRefGoogle Scholar
  62. Trivedi, K. et al.: Achieving and assuring high availability. In: Nanya, T., et al. (eds.) 5th International Service Availability Symposium (ISAS). Lecture Notes in Computer Science, pp. 20–25. Springer Verlag, Tokyo/Berlin/Heidelberg (2008)Google Scholar
  63. Wang, S., Watada, J.: Modelling redundancy allocation for a fuzzy random parallel-series system. J. Comput. Appl. Math. 232, 539–557 (2009)CrossRefzbMATHGoogle Scholar
  64. Zhao, R., Liu, B.: Redundancy optimization problems with uncertainty of combining randomness and fuzziness. Eur. J. Oper. Res. 157, 716–735 (2004)CrossRefzbMATHMathSciNetGoogle Scholar
  65. Ziaee, M.: Optimal redundancy allocation in hierarchical series–parallel systems using mixed integer programming. Appl. Math. 4, 79–83 (2013)CrossRefGoogle Scholar
  66. Zille, V., et al.: Simulation of maintained multicomponent systems for dependability assessment. In: Faulin, P., et al. (eds.) Simulation Methods for Reliability and Availability of Complex Systems, pp. 253–272. Springer, Berlin/Heidelberg (2010)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Magdeburg Research and Competence Cluster for Very Large Business Applications, Faculty of Computer ScienceOtto-von-Guericke University MagdeburgMagdeburgGermany

Personalised recommendations