Skip to main content

Scalable NIC Architecture to Support Offloading of Large Scale MPI Barrier

  • Conference paper
Advanced Parallel Processing Technologies (APPT 2013)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8299))

Included in the following conference series:

  • 1350 Accesses

Abstract

MPI collective communication overhead dominates the communication cost for large scale parallel computers, scalability and operation latency for collective communication is critical for next generation computers. This paper proposes a fast and scalable barrier communication offload approach which supports millions of compute cores. Following our approach, the barrier operation sequence is packed by host MPI driver into the barrier ”descriptor”, which is pushed to the NIC (Network-Interfaces). The NIC can complete the barrier automatically following its algorithm descriptor. Our approach leverages an enhanced dissemination algorithm which is suitable for current large scale networks. We show that our approach achieves both barrier performance and scalability, especially for large scale computer system. This paper also proposes an extendable and easy-to-implement NIC architecture supporting barrier offload communication and also other communication pattern.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of Collective Communication Operations in MPICH. International Journal of High Performance Computing Applications 19(1), 49–66 (2005)

    Article  Google Scholar 

  2. Miyazaki, H., Kusano, Y., Shinjou, N., Shoji, F., Yokokawa, M., Watanabe, T.: Overview of the K computer System. FUJITSU Sci. Tech. J. 48(3), 255–265 (2012)

    Google Scholar 

  3. Venkata, M.G., Graham, R.L., Ladd, J., Shamis, P.: Exploring the All-to-All Collective Optimization Space with ConnectX CORE-Direct. In: 2012 41st International Conference on Parallel Processing (ICPP), pp. 289–298. IEEE (2012)

    Google Scholar 

  4. Xie, M., Lu, Y., Liu, L., Cao, H., Yang, X.: Implementation and Evaluation of Network Interface and Message Passing Services for TianHe-1A Supercomputer. In: 2011 IEEE 19th Annual Symposium on High-Performance Interconnects (HOTI), pp. 78–86. IEEE (2011)

    Google Scholar 

  5. Hemmert, K.S., Barrett, B., Underwood, K.D.: Using triggered operations to offload collective communication operations. In: Keller, R., Gabriel, E., Resch, M., Dongarra, J. (eds.) EuroMPI 2010. LNCS, vol. 6305, pp. 249–256. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  6. Hoefler, T., Mehlan, T., Mietke, F., Rehm, W.: Fast barrier synchronization for InfiniBandTM. In: IPDPS 2006: Proceedings of the 20th International Conference on Parallel and Distributed Processing. IEEE Computer Society (April 2006)

    Google Scholar 

  7. Hensgen, D., Finkel, R., Manber, U.: Two algorithms for barrier synchronization. International Journal of Parallel Programming 17(1) (February 1988)

    Google Scholar 

  8. Mamidala, A.R.: Scalable and High Performance Collective Communication for Next Generation Multicore Infiniband Clusters. Phd Thesis (2008)

    Google Scholar 

  9. Sonja, F.: Hardware Support for Efficient Packet Processing. Phd Thesis, pp. 1–207 (March 2012)

    Google Scholar 

  10. Tamir, Y., Frazier, G.L.: Dynamically-allocated multi-queue buffers for VLSI communication switches. IEEE Transactions on Computers 41(6), 725–737 (1992)

    Article  Google Scholar 

  11. Tipparaju, V., Gropp, W., Ritzdorf, H., Thakur, R., Traff, J.L.: Investigating High Performance RMA Interfaces for the MPI-3 Standard. In: 2009 International Conference on Parallel Processing (ICPP), pp. 293–300. IEEE (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wang, S., Xu, W., Wu, D., Pang, Z., Lu, P. (2013). Scalable NIC Architecture to Support Offloading of Large Scale MPI Barrier. In: Wu, C., Cohen, A. (eds) Advanced Parallel Processing Technologies. APPT 2013. Lecture Notes in Computer Science, vol 8299. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45293-2_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-45293-2_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-45292-5

  • Online ISBN: 978-3-642-45293-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics