Skip to main content

High Performance Checksum Computation for Fault-Tolerant MPI over Infiniband

  • Conference paper
  • 1403 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 7490))

Abstract

With the increase of the number of nodes in clusters, the probability of failures and unusual events increases. In this paper, we present checksum mechanisms to detect data corruption. We study the impact of checksums on network communication performance and we propose a mechanism to amortize their cost on InfiniBand. We have implemented our mechanisms in the NewMadeleine communication library. Our evaluation shows that our mechanisms to ensure message integrity do not impact noticeably the application performance, which is an improvement over the state of the art MPI implementations.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aumage, O., Brunet, E., Furmento, N., Namyst, R.: NewMadeleine: a Fast Communication Scheduling Engine for High Performance Networks. In: CAC 2007: Workshop on Communication Architecture for Clusters, held in conjunction with IPDPS 2007 (2007), http://hal.inria.fr/inria-00127356

  2. Bertier, M., Marin, O., Sens, P.: Implementation and performance evaluation of an adaptable failure detector. In: International Conference on Dependable Systems and Networks (2002)

    Google Scholar 

  3. Brunet, E., Trahay, F., Denis, A., Namyst, R.: A sampling-based approach for communication libraries auto-tuning. In: International Conference on Cluster Computing (IEEE Cluster), pp. 299–307. IEEE Computer Society Press, Austin (2011), http://hal.inria.fr/inria-00605735/

    Google Scholar 

  4. Cappello, F., Geist, A., Gropp, B., Kale, L., Kramer, B., Snir, M.: Toward exascale resilience. International Journal of High Performance Computing Applications 23(4) (2009)

    Google Scholar 

  5. Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K.E., Santos, E., Subramonian, R., von Eicken, T.: Logp: towards a realistic model of parallel computation. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 1993, pp. 1–12. ACM, New York (1993), http://doi.acm.org/10.1145/155332.155333

    Chapter  Google Scholar 

  6. Denis, A.: A High Performance Superpipeline Protocol for InfiniBand. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011, Part II. LNCS, vol. 6853, pp. 276–287. Springer, Heidelberg (2011), http://hal.inria.fr/inria-00586015/

    Chapter  Google Scholar 

  7. Dinaburg, A.: Bitsquatting, DNS hijacking without exploitation. In: Black Hat Conference (July 2011)

    Google Scholar 

  8. Feldmeier, D.C.: Fast software implementation of error detection codes. IEEE/ACM Trans. Netw. 3(6), 640–651 (1995), http://dx.doi.org/10.1109/90.477710

    Article  Google Scholar 

  9. Fletcher, J.: An arithmetic checksum for serial transmissions. IEEE Transactions on Communications 30(1), 247–252 (1982)

    Article  Google Scholar 

  10. Fowler, G., Noll, L.C., Vo, K.P., Eastlake, D.: The FNV non-cryptographic hash algorithm. IETF Internet-draft (March 2012)

    Google Scholar 

  11. Graham, R., Choi, S., Daniel, D., Desai, N., Minnich, R., Rasmussen, C., Risinger, L., Sukalski, M.: A network-failure-tolerant message-passing system for terascale clusters. International Journal of Parallel Programming 31(4) (2003)

    Google Scholar 

  12. Jenkins, B.: Hash functions. Dr Dobb’s Journal (September 1997)

    Google Scholar 

  13. Maxino, T.C., Koopman, P.J.: The effectiveness of checksums for embedded control networks. IEEE Transactions on Dependable and Secure Computing 6(1) (January 2009)

    Google Scholar 

  14. Mercier, G., Trahay, F., Buntinas, D., Brunet, É.: NewMadeleine: An Efficient Support for High-Performance Networks in MPICH2. In: Proceedings of 23rd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2009). IEEE Computer Society Press, Rome (2009), http://hal.archives-ouvertes.fr/hal-00360275

    Google Scholar 

  15. Shipman, G.M., Graham, R.L., Bosilca, G.: Network Fault Tolerance in Open MPI. In: Kermarrec, A.-M., Bougé, L., Priol, T. (eds.) Euro-Par 2007. LNCS, vol. 4641, pp. 868–878. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  16. Zwaenepoel, D., Johnson, D.: Sender-based message logging. In: 17th International Symposium on Fault-Tolerant Computing

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Denis, A., Trahay, F., Ishikawa, Y. (2012). High Performance Checksum Computation for Fault-Tolerant MPI over Infiniband. In: Träff, J.L., Benkner, S., Dongarra, J.J. (eds) Recent Advances in the Message Passing Interface. EuroMPI 2012. Lecture Notes in Computer Science, vol 7490. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33518-1_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-33518-1_23

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-33517-4

  • Online ISBN: 978-3-642-33518-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics