Multi-Fault Tolerance for Cartesian Data Distributions

  • Nawab Ali
  • Sriram Krishnamoorthy
  • Mahantesh Halappanavar
  • Jeff Daily


Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. Algorithm-based fault tolerance is a promising approach that involves modifications to the algorithm to recover from faults with lower overheads than replicated storage and a significant reduction in lost work compared to checkpoint-restart techniques. Fault-tolerant linear algebra algorithms employ additional processors that store parities along the dimensions of a matrix to tolerate multiple, simultaneous faults. Existing approaches assume regular data distributions (blocked or block-cyclic) with the failures of each data block being independent. To match the characteristics of failures on parallel computers, we extend these approaches to mapping parity blocks in several important ways. First, we handle parity computation for generalized Cartesian data distributions with each processor holding arbitrary subsets of blocks in a Cartesian-distributed array. Second, techniques to handle correlated failures, i.e., multiple processors that can be expected to fail together, are presented. Third, we handle the colocation of parity blocks with the data blocks and do not require them to be on additional processors. Several alternative approaches, based on graph matching, are presented that attempt to balance the memory overhead on processors while guaranteeing the same fault tolerance properties as existing approaches that assume independent failures on regular blocked data distributions. Evaluation of these algorithms demonstrates that the additional desirable properties are provided by the proposed approach with minimal overhead.


Fault tolerance Fault tolerant linear algebra Checksums Data distribution 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ali, N., Carns, P.H., Iskra, K., Kimpe, D., Lang, S., Latham, R., Ross, R.B., Ward, L., Sadayappan, P.: Scalable I/O forwarding framework for high-performance computing systems. In: IEEE International Conference on Cluster Computing, pp. 1–10, Aug (2009)Google Scholar
  2. 2.
    Ali, N., Krishnamoorthy, S., Govind, N., Kowalski, K., Sadayappan, P.: Application-specific fault tolerance via data access characterization. In International European Conference on Parallel and Distributed Computing, Aug (2011a)Google Scholar
  3. 3.
    Ali, N., Krishnamoorthy, S., Govind, N., Palmer, B.: A redundant communication approachq to scalable fault tolerance in PGAS programming models. In: 19th Euromicro International Conference on Parallel, Distributed and Network-Based Computing, Ayia Napa, Cyprus, Feb (2011b)Google Scholar
  4. 4.
    Ali, N., Krishnamoorthy, S., Halappanavar, M., Daily, J.: Tolerating correlated failures for generalized cartesian distributions via bipartite matching. In: ACM International Conference on Computing Frontiers, May (2011c)Google Scholar
  5. 5.
    Bosilca G., Delmas R., Dongarra J., Langou J.: Algorithm-based fault tolerance applied to high performance computing. J. Parallel Distrib. Comput. 69(4), 410–416 (2009)CrossRefGoogle Scholar
  6. 6.
    Bronevetsky, G., Moody, A.: Scalable I/O systems via node-local storage: approaching 1 TB/sec file I/O. Technical report LLNL-TR-415791, Lawrence Livermore National Laboratory, Aug (2009)Google Scholar
  7. 7.
    Burkard R., Dell’Amico M., Martello S.: Assignment Problems. Society for Industrial and Applied Mathematics, Philadelphia (2009)MATHCrossRefGoogle Scholar
  8. 8.
    Chen, Z., Dongarra, J.: Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources. In: IEEE International Parallel & Distributed Processing Symposium, Apr (2006)Google Scholar
  9. 9.
    Chen Z., Dongarra J.: Algorithm-based fault tolerance for fail-stop failures. IEEE Trans. Parallel Distrib. Syst. 19(12), 1628–1641 (2008)CrossRefGoogle Scholar
  10. 10.
    Costa, P., Pasin, M., Bessani, A., Correia, M.: Byzantine fault-tolerant mapreduce: faults are not just crashes. In: IEEE International Conference on Cloud Computing Technology and Science, pp. 32–39 (2011)Google Scholar
  11. 11.
    Darte A., Mellor-Crummey J., Fowler R., Chavarría-Miranda D.: Generalized multipartitioning of multi-dimensional arrays for parallelizing line-sweep computations. J. Parallel Distrib. Comput. 63(9), 887–911 (2003)MATHCrossRefGoogle Scholar
  12. 12.
    Dean, J., Ghemawat S.: MapReduce: simplified data processing on large clusters. In: USENIX Symposium on Operating Systems Design and Implementation, pp. 137–150 (2004)Google Scholar
  13. 13.
    Elnozahy E.N., Alvisi L., Wang Y.-M., Johnson D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)CrossRefGoogle Scholar
  14. 14.
    Engelmann, C., Vallée, G., Naughton, T., Scott, S.L.: Proactive fault tolerance using preemptive migration. In: International Conference on Parallel, Distributed and Network-based Processing, pp. 252–257, Feb (2009)Google Scholar
  15. 15.
    Fagg, G.E., Dongarra, J.: FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Proceedings of the 7th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 346–353 (2000)Google Scholar
  16. 16.
    Gabow H.N.: An efficient implementation of edmonds’ algorithm for maximum matching on graphs. J. ACM 23(2), 221–234 (1976)MathSciNetMATHCrossRefGoogle Scholar
  17. 17.
    Gupta, R., Beckman, P., Park, B.-H., Lusk, E., Hargrove, P., Geist, A., Panda, D., Lumsdaine, A., Dongarra, J.: CIFTS: a coordinated infrastructure for fault-tolerant systems. In: Proceedings of the International Conference on Parallel Processing, pp. 237–245 (2009)Google Scholar
  18. 18.
    Halappanavar, M.: Algorithms for vertex-weighted matching in graphs. PhD thesis, Old Dominion University, Norfolk, VA (2009)Google Scholar
  19. 19.
    Hargrove P.H., Duell J.C.: Berkeley lab checkpoint/restart (BLCR) for Linux clusters. J. Phys. Conf. Ser. 46(1), 494–499 (2006)CrossRefGoogle Scholar
  20. 20.
    Hopcroft J., Karp R.: A \({n^{\frac{5}{2}}}\) algorithm for maximum matchings in bipartite graphs. SIAM J. Comput. 2, 225–231 (1973)MathSciNetMATHCrossRefGoogle Scholar
  21. 21.
    Huang K.-H., Abraham J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 33(6), 518–528 (1984)MATHCrossRefGoogle Scholar
  22. 22.
  23. 23.
    Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems, pp. 59–72 (2007)Google Scholar
  24. 24.
    Kuhn H.W.: The Hungarian method for the assignment problem. Naval Res. Logist. Q. 2, 83–97 (1955)CrossRefGoogle Scholar
  25. 25.
    Lawler E.: Combinatorial Optimization: Networks and Matroids. Dover Publications, Mineola (2001)MATHGoogle Scholar
  26. 26.
    Lovasz L., Plummer M.D.: Matching Theory. North-Holland Publishing Co., Amsterdam (1986)MATHGoogle Scholar
  27. 27.
    Motwani R.: Average-case analysis of algorithms for matchings and related problems. J. ACM 41(6), 1329–1356 (1994)MathSciNetMATHCrossRefGoogle Scholar
  28. 28.
    Nieplocha J., Palmer B., Tipparaju V., Krishnan M., Trease H., Aprà à E.: Advances, applications and performance of the global arrays shared memory programming toolkit. Int. J. High Perform. Comput. Appl. 20, 203–231 (2006)CrossRefGoogle Scholar
  29. 29.
    Panda, D.K.: MVAPICH.
  30. 30.
    Papadimitriou C.H., Steiglitz K.: Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall Inc., Upper Saddle River (1982)MATHGoogle Scholar
  31. 31.
    Plank, J., Li, K.: Faster checkpointing with N + 1 parity. In: International Symposium on Fault-Tolerant Computing, pp. 288–297, June (1994)Google Scholar
  32. 32.
    Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: transparent checkpointing under Unix. In: Usenix Winter Technical Conference, pp. 213–223, Jan (1995)Google Scholar
  33. 33.
    Plank J.S., Li K., Puening M.A.: Diskless checkpointing. IEEE Trans. Parallel Distrib. Syst. 9(10), 972–986 (1998)CrossRefGoogle Scholar
  34. 34.
    Schrijver A.: Combinatorial Optimization: Polyhedra and Efficiency. Springer Publishing Co., New York (2003)MATHGoogle Scholar
  35. 35.
    Schroeder B., Gibson G.A.: Understanding failures in petascale computers. J. Phys. Conf. Ser. 78(1), 1–11 (2007)Google Scholar
  36. 36.
    Tipparaju, V., Krishnan, M., Palmer, B., Petrini, F., Nieplocha, J.: Towards fault resilient global arrays. In: International Conference on Parallel Computing, vol. 15, pp. 339–345 (2007)Google Scholar
  37. 37.
    The ScaLAPACK project.
  38. 38.
    Valiev M., Bylaska E., Govind N., Kowalski K., Straatsma T., Dam H.V., Wang D., Nieplocha J., Apra E., Windus T., de Jong W.: NWChem: a comprehensive and scalable open-source solution for large scale molecular simulations. Comput. Phys. Commun. 181(9), 1477–1489 (2010)MATHCrossRefGoogle Scholar
  39. 39.
    Wang, C., Mueller, F., Engelmann, C., Scott, S.L.: Proactive process-level live migration in HPC environments. In: Proceedings of the ACM/IEEE Conference on Supercomputing, pp. 1–12, Nov (2008)Google Scholar
  40. 40.
    Wolsey L.A.: Integer Programming. Wiley, Hoboken (1998)MATHGoogle Scholar
  41. 41.
    Zheng, G., Shi, L., Kale, L.V.: FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for charm++ and MPI. In: IEEE International Conference on Cluster Computing, pp. 93–103, Sept (2004)Google Scholar

Copyright information

© Springer Science+Business Media New York 2012

Authors and Affiliations

  • Nawab Ali
    • 1
  • Sriram Krishnamoorthy
    • 1
  • Mahantesh Halappanavar
    • 1
  • Jeff Daily
    • 1
  1. 1.Pacific Northwest National LaboratoryRichlandUSA

Personalised recommendations