Abstract
With the increase of the number of nodes in clusters, the probability of failures and unusual events increases. In this paper, we present checksum mechanisms to detect data corruption. We study the impact of checksums on network communication performance and we propose a mechanism to amortize their cost on InfiniBand. We have implemented our mechanisms in the NewMadeleine communication library. Our evaluation shows that our mechanisms to ensure message integrity do not impact noticeably the application performance, which is an improvement over the state of the art MPI implementations.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Aumage, O., Brunet, E., Furmento, N., Namyst, R.: NewMadeleine: a Fast Communication Scheduling Engine for High Performance Networks. In: CAC 2007: Workshop on Communication Architecture for Clusters, held in conjunction with IPDPS 2007 (2007), http://hal.inria.fr/inria-00127356
Bertier, M., Marin, O., Sens, P.: Implementation and performance evaluation of an adaptable failure detector. In: International Conference on Dependable Systems and Networks (2002)
Brunet, E., Trahay, F., Denis, A., Namyst, R.: A sampling-based approach for communication libraries auto-tuning. In: International Conference on Cluster Computing (IEEE Cluster), pp. 299–307. IEEE Computer Society Press, Austin (2011), http://hal.inria.fr/inria-00605735/
Cappello, F., Geist, A., Gropp, B., Kale, L., Kramer, B., Snir, M.: Toward exascale resilience. International Journal of High Performance Computing Applications 23(4) (2009)
Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K.E., Santos, E., Subramonian, R., von Eicken, T.: Logp: towards a realistic model of parallel computation. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 1993, pp. 1–12. ACM, New York (1993), http://doi.acm.org/10.1145/155332.155333
Denis, A.: A High Performance Superpipeline Protocol for InfiniBand. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011, Part II. LNCS, vol. 6853, pp. 276–287. Springer, Heidelberg (2011), http://hal.inria.fr/inria-00586015/
Dinaburg, A.: Bitsquatting, DNS hijacking without exploitation. In: Black Hat Conference (July 2011)
Feldmeier, D.C.: Fast software implementation of error detection codes. IEEE/ACM Trans. Netw. 3(6), 640–651 (1995), http://dx.doi.org/10.1109/90.477710
Fletcher, J.: An arithmetic checksum for serial transmissions. IEEE Transactions on Communications 30(1), 247–252 (1982)
Fowler, G., Noll, L.C., Vo, K.P., Eastlake, D.: The FNV non-cryptographic hash algorithm. IETF Internet-draft (March 2012)
Graham, R., Choi, S., Daniel, D., Desai, N., Minnich, R., Rasmussen, C., Risinger, L., Sukalski, M.: A network-failure-tolerant message-passing system for terascale clusters. International Journal of Parallel Programming 31(4) (2003)
Jenkins, B.: Hash functions. Dr Dobb’s Journal (September 1997)
Maxino, T.C., Koopman, P.J.: The effectiveness of checksums for embedded control networks. IEEE Transactions on Dependable and Secure Computing 6(1) (January 2009)
Mercier, G., Trahay, F., Buntinas, D., Brunet, É.: NewMadeleine: An Efficient Support for High-Performance Networks in MPICH2. In: Proceedings of 23rd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2009). IEEE Computer Society Press, Rome (2009), http://hal.archives-ouvertes.fr/hal-00360275
Shipman, G.M., Graham, R.L., Bosilca, G.: Network Fault Tolerance in Open MPI. In: Kermarrec, A.-M., Bougé, L., Priol, T. (eds.) Euro-Par 2007. LNCS, vol. 4641, pp. 868–878. Springer, Heidelberg (2007)
Zwaenepoel, D., Johnson, D.: Sender-based message logging. In: 17th International Symposium on Fault-Tolerant Computing
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Denis, A., Trahay, F., Ishikawa, Y. (2012). High Performance Checksum Computation for Fault-Tolerant MPI over Infiniband. In: Träff, J.L., Benkner, S., Dongarra, J.J. (eds) Recent Advances in the Message Passing Interface. EuroMPI 2012. Lecture Notes in Computer Science, vol 7490. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33518-1_23
Download citation
DOI: https://doi.org/10.1007/978-3-642-33518-1_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33517-4
Online ISBN: 978-3-642-33518-1
eBook Packages: Computer ScienceComputer Science (R0)