A Case for Epidemic Fault Detection and Group Membership in HPC Storage Systems

Snyder, Shane; Carns, Philip; Jenkins, Jonathan; Harms, Kevin; Ross, Robert; Mubarak, Misbah; Carothers, Christopher

doi:10.1007/978-3-319-17248-4_12

A Case for Epidemic Fault Detection and Group Membership in HPC Storage Systems

Shane Snyder¹⁶,
Philip Carns¹⁶,
Jonathan Jenkins¹⁶,
Kevin Harms¹⁶,
Robert Ross¹⁶,
Misbah Mubarak¹⁷ &
…
Christopher Carothers¹⁷

Conference paper
First Online: 01 January 2015

1111 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8966))

Abstract

Fault response strategies are crucial to maintaining performance and availability in HPC storage systems, and the first responsibility of a successful fault response strategy is to detect failures and maintain an accurate view of group membership. This is a nontrivial problem given the unreliable nature of communication networks and other system components. As with many engineering problems, trade-offs must be made to account for the competing goals of fault detection efficiency and accuracy.

Today’s production HPC services typically rely on distributed consensus algorithms and heartbeat monitoring for group membership. In this work, we investigate epidemic protocols to determine whether they would be a viable alternative. Epidemic protocols have been proposed in previous work for use in peer-to-peer systems, but they have the potential to increase scalability and decrease fault response time for HPC systems as well. We focus our analysis on the Scalable Weakly-consistent Infection-style Process Group Membership (SWIM) protocol.

We begin by exploring how the semantics of this protocol differ from those of typical HPC group membership protocols, and we discuss how storage systems might need to adapt as a result. We use existing analytical models to choose appropriate SWIM parameters for an HPC use case. We then develop a new, high-resolution parallel discrete event simulation of the protocol to confirm existing analytical models and explore protocol behavior that cannot be readily observed with analytical models. Our preliminary results indicate that the SWIM protocol is a promising alternative for group membership in HPC storage systems, offering rapid convergence, tolerance to transient network failures, and minimal network load.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Aguilera, M.K., Chen, W., Toueg, S.: Heartbeat: A timeout-free failure detector for quiescent reliable communication. In: Mavronicolas, Marios (ed.) WDAG 1997. LNCS, vol. 1320, pp. 126–140. Springer, Heidelberg (1997)
Chapter Google Scholar
Alexandrov, A., Ionescu, M.F., Schauser, K.E., Scheiman, C.: LogGP: Incorporating long messages into the LogP model – one step closer towards a realistic model for parallel computation. In: Proceedings of the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA 1995, pp. 95–105. ACM, New York (1995). http://doi.acm.org/10.1145/215399.215427
Amir, Y., Moser, L.E., Melliar-Smith, P.M., Agarwal, D.A., Ciarfella, P.: The totem single-ring ordering and membership protocol. ACM Trans. Comput. Syst. 13(4), 311–342 (1995)
Article Google Scholar
Barnes, Jr., P.D., Carothers, C.D., Jefferson, D.R., LaPre, J.M.: Warp speed: Executing time warp on 1,966,080 cores. In: Proceedings of the 2013 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, SIGSIM-PADS 2013, pp. 327–336. ACM, New York (2013). http://doi.acm.org/10.1145/2486092.2486134
Beekhof, A.: Pacemaker: a scalable high availability cluster resource manager. http://clusterlabs.org/. Retrieved July 2014
Birman, K.P.: The process group approach to reliable distributed computing. Commun. ACM 36(12), 37–53 (1993). http://doi.acm.org/10.1145/163298.163303
Article Google Scholar
Birman, K.P., Hayden, M., Ozkasap, O., Xiao, Z., Budiu, M., Minsky, Y.: Bimodal multicast. ACM Trans. Comput. Syst. 17(2), 41–88 (1999). http://doi.acm.org/10.1145/312203.312207
Article Google Scholar
Chen, W., Toueg, S., Aguilera, M.K.: On the quality of service of failure detectors. IEEE Trans. Comput. 51(5), 561–580 (2002)
Article MathSciNet Google Scholar
Cope, J., Liu, N., Lang, S., Carns, P., Carothers, C., Ross, R.: Codes: Enabling co-design of multilayer exascale storage architectures. In: Proceedings of the Workshop on Emerging Supercomputing Technologies (2011)
Google Scholar
Dake, S.C., Caulfield, C., Beekhof, A.: The Corosync cluster engine. In: Linux Symposium, vol. 85 (2008)
Google Scholar
Das, A., Gupta, I., Motivala, A.: Swim: Scalable weakly-consistent infection-style process group membership protocol. In: Proceedings of the 2002 International Conference on Dependable Systems and Networks, DSN 2002, pp. 303–312. IEEE Computer Society Press, Washington, DC (2002). http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1028914
Gropp, W., Lusk, E.: Reproducible measurements of MPI performance characteristics. In: Margalef, T., Dongarra, J., Luque, E. (eds.) PVM/MPI 1999. LNCS, vol. 1697, pp. 11–18. Springer, Heidelberg (1999)
Chapter Google Scholar
Gupta, I., Chandra, T.D., Goldszmidt, G.S.: On scalable and efficient distributed failure detectors. In: Proceedings of the Twentieth Annual ACM Symposium on Principles of Distributed Computing, PODC 2001, pp. 170–179. ACM Press, New York (2001). http://doi.acm.org/10.1145/383962.384010
Hoefler, T., Mehlan, T., Lumsdaine, A., Rehm, W.: Netgauge: a network performance measurement framework. In: Perrott, R., Chapman, B.M., Subhlok, J., de Mello, R.F., Yang, L.T. (eds.) HPCC 2007. LNCS, vol. 4782, pp. 659–671. Springer, Heidelberg (2007)
Chapter Google Scholar
Jahanian, F., Fakhouri, S., Rajkumar, R.: Processor group membership protocols: specification, design and implementation. In: Proceedings of the 12th Symposium on Reliable Distributed Systems, pp. 2–11, October 1993
Google Scholar
Lakshman, A., Malik, P.: Cassandra: A decentralized structured storage system. SIGOPS Operating Sys. Rev. 44(2), 35–40 (2010). http://doi.acm.org/10.1145/1773912.1773922
Article Google Scholar
Lamport, L.: The part-time parliament. ACM Trans. Comput. Syst. (TOCS) 16(2), 133–169 (1998)
Article Google Scholar
Liu, N., Cope, J., Carns, P., Carothers, C., Ross, R., Grider, G., Crume, A., Maltzahn, C.: On the role of burst buffers in leadership-class storage systems. In: 2012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–11. IEEE (2012)
Google Scholar
Reiter, M.K.: A secure group membership protocol. IEEE Trans. Softw. Eng. 22(1), 31–42 (1996)
Article Google Scholar
van Renesse, R., Minsky, Y., Hayden, M.: A gossip-style failure detection service. In: Davies, N., Jochen, S., Raymond, K. (eds.) Middleware 1998, pp. 55–70. Springer, London (1998). http://dx.doi.org/10.1007/978-1-4471-1283-9_4
Google Scholar
Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.D.E., Maltzahn, C.: Ceph: A scalable, high-performance distributed file system. In: Proceedings of the 7th Symposium on Operating Systems Design and Implementation, OSDI 2006, pp. 307–320. USENIX Association, Berkeley (2006). http://dl.acm.org/citation.cfm?id=1298485

Download references

Acknowledgments

This research was supported by the U.S. Department of Defense. This material also was based on work supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computer Research Program under contract DE-AC02-06CH11357. The research used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory, which is a DOE Office of Science User Facility.

Author information

Authors and Affiliations

Argonne National Laboratory, Argonne, IL, USA
Shane Snyder, Philip Carns, Jonathan Jenkins, Kevin Harms & Robert Ross
Rensselaer Polytechnic Institute, Troy, NY, USA
Misbah Mubarak & Christopher Carothers

Authors

Shane Snyder
View author publications
You can also search for this author in PubMed Google Scholar
Philip Carns
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan Jenkins
View author publications
You can also search for this author in PubMed Google Scholar
Kevin Harms
View author publications
You can also search for this author in PubMed Google Scholar
Robert Ross
View author publications
You can also search for this author in PubMed Google Scholar
Misbah Mubarak
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Carothers
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shane Snyder .

Editor information

Editors and Affiliations

University of Warwick, Coventry, United Kingdom
Stephen A. Jarvis
University of Warwick, Coventry, United Kingdom
Steven A. Wright
Sandia National Laboratories CSRI, Albuquerque, New Mexico, USA
Simon D. Hammond

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Snyder, S. et al. (2015). A Case for Epidemic Fault Detection and Group Membership in HPC Storage Systems. In: Jarvis, S., Wright, S., Hammond, S. (eds) High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation. PMBS 2014. Lecture Notes in Computer Science(), vol 8966. Springer, Cham. https://doi.org/10.1007/978-3-319-17248-4_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-17248-4_12
Published: 18 April 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-17247-7
Online ISBN: 978-3-319-17248-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics