The Impact of Global Communication Latency at Extreme Scales on Krylov Methods

  • Thomas J. Ashby
  • Pieter Ghysels
  • Wim Heirman
  • Wim Vanroose
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7439)


Krylov Subspace Methods (KSMs) are popular numerical tools for solving large linear systems of equations. We consider their role in solving sparse systems on future massively parallel distributed memory machines, by estimating future performance of their constituent operations. To this end we construct a model that is simple, but which takes topology and network acceleration into account as they are important considerations. We show that, as the number of nodes of a parallel machine increases to very large numbers, the increasing latency cost of reductions may well become a problematic bottleneck for traditional formulations of these methods. Finally, we discuss how pipelined KSMs can be used to tackle the potential problem, and appropriate pipeline depths.


Krylov methods extreme scaling global communication reduction latency pipelining latency hiding 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Retrieved from MVAPICH2 website (2012), (2011)
  2. 2.
    Adiga, N., et al.: An overview of the BlueGene/L supercomputer. In: ACM/IEEE 2002 Conference on Supercomputing, p. 60 (November 2002)Google Scholar
  3. 3.
    Ajima, Y., Sumimoto, S., Shimizu, T.: Tofu: A 6D mesh/torus interconnect for exascale computers. Computer 42(11), 36–40 (2009)CrossRefGoogle Scholar
  4. 4.
    Arimilli, B., et al.: The PERCS high-performance interconnect. In: IEEE HOTI 2010, pp. 75–82 (August 2010)Google Scholar
  5. 5.
    Ashby, T.J., O’Boyle, M.: Iterative collective loop fusion. In: Mycroft, A., Zeller, A. (eds.) CC 2006. LNCS, vol. 3923, pp. 202–216. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  6. 6.
    Chen, D., Eisley, N.A., Heidelberger, P., Senger, R.M., Sugawara, Y., Kumar, S., Salapura, V., Satterfield, D.L., Steinmacher-Burow, B., Parker, J.J.: The IBM BlueGene/Q interconnection network and message unit. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. SC 2011, pp. 26–27. ACM, New York (2011)Google Scholar
  7. 7.
    Ghysels, P., Ashby, T.J., Meerbergen, K., Vanroose, W.: Hiding global communication latency in the GMRES algorithm on massively parallel machines (to be published, 2012)Google Scholar
  8. 8.
    Gmeiner, B., Gradl, T., Köstler, H., Rüde, U.: Analysis of a flat highly parallel geometric multigrid algorithm for hierarchical hybrid grids. Technical report, Dept. Comp. Sci., Universität Erlangen-Nürnberg (2011)Google Scholar
  9. 9.
    Hernández, V., Román, J.E., Tomás, A.: A parallel variant of the Gram-Schmidt process with reorthogonalization. In: PARCO, pp. 221–228 (2005)Google Scholar
  10. 10.
    Hoefler, T., Lumsdaine, A.: Overlapping communication and computation with high level communication routines. In: Proceedings of the 8th IEEE Symposium on Cluster Computing and the Grid (CCGrid 2008) (May 2008)Google Scholar
  11. 11.
    Hoefler, T., Schneider, T., Lumsdaine, A.: Characterizing the influence of system noise on large-scale applications by simulation. In: ACM/IEEE Supercomputing 2010, pp. 1–11 (2010)Google Scholar
  12. 12.
    Moody, A., Fernandez, J., Petrini, F., Panda, D.K.: Scalable NIC-based reduction on large-scale clusters. In: ACM/IEEE Supercomputing 2003, pages 59 (2003)Google Scholar
  13. 13.
    Stevens, R., White, A., et al.: Architectures and technology for extreme scale computing. Technical report, ASCR Scientic Grand Challenges Workshop Series (December 2009)Google Scholar
  14. 14.
    Tianruo Yang, L., Brent, R.: The improved Krylov subspace methods for large and sparse linear systems on bulk synchronous parallel architectures. In: IEEE IPDPS 2003, p. 11 (April 2003)Google Scholar
  15. 15.
    Udipi, A.N., Muralimanohar, N., Balasubramonian, R., Davis, A., Jouppi, N.P.: Combining memory and a controller with photonics through 3D-stacking to enable scalable and energy-efficient systems. In: ISCA 2011, pp. 425–436 (2011)Google Scholar
  16. 16.
    Woo, D.H., Seong, N.H., Lewis, D., Lee, H.-H.: An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth. In: IEEE HPCA 2010, pp. 1–12 (January 2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Thomas J. Ashby
    • 1
    • 2
  • Pieter Ghysels
    • 1
    • 3
  • Wim Heirman
    • 1
    • 4
  • Wim Vanroose
    • 3
  1. 1.Intel/Flanders Exascience LabLeuvenBelgium
  2. 2.ImecLeuvenBelgium
  3. 3.Universiteit AntwerpenAntwerpBelgium
  4. 4.Universiteit GentGhentBelgium

Personalised recommendations