Massively parallel computing: Data distribution and communication

  • S. Lennart Johnsson
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 678)


We discuss some techniques for preserving locality of reference in index spaces when mapped to memory units in a distributed memory architecture. In particular, we discuss the use of multidimensional address spaces instead of linearized address spaces, partitioning of irregular grids, and placement of partitions among nodes. We also discuss a set of communication primitives we have found very useful on the Connection Machine systems in implementing scientific and engineering applications. We briefly review some of the techniques used to fully utilize the bandwidth of the binary cube network of the CM-2 and CM-200, and give some performance data from implementations of communication primitives.


Local Memory Hamiltonian Path Address Space Gray Code Index Space 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    B. Alspach, J.-C. Bermond, and D. Sotteau. Decomposition into cycles i: Hamilton decompositions. In G. Hahn et. al., editor, Cycles and Graphs, pages 9–18. Kluwer Academic Publishers, 1990.Google Scholar
  2. [2]
    Christopher R. Anderson. An implementation of the fast multipole method without multipoles. SIAM J. Sci. Stat. Comp., 13(4):923–947, July 1992.Google Scholar
  3. [3]
    D. P. Bertsekas, C. Ozveren, G.D. Stamoulis, P. Tseng, and J.N. Tsitsiklis. Optimal communication algorithms for hypercubes. Journal of Parallel and Distributed Computing, 11:263–275, 1991.Google Scholar
  4. [4]
    M. Bromley, Steve Heller, Tim McNerny, and Guy Steele. Fortran at ten Gigaflops: The Connection Machine convolution compiler. In Proceedings of ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation. ACM Press, 1991.Google Scholar
  5. [5]
    Jean-Philippe Brunet and S. Lennart Johnsson. All-to-all broadcast with applications on the Connection Machine. International Journal of Supercomputer Applications, 6(3):241–256, 1992.Google Scholar
  6. [6]
    J. Carrier, L. Greengard, and V. Rokhlin. A fast adaptive multipole algorithm for particle simulations. SIAM J. of Scientific and Statistical Computations, 9(4):669–686, July 1988.Google Scholar
  7. [7]
    M.Y. Chan. Embedding of grids into optimal hypercubes. SIAM J. Computing, 20(5):834–864, 1991.Google Scholar
  8. [8]
    G. Dahlquist, Å. Björck, and N. Anderson. Numerical Methods. Series in Automatic Computation. Prentice Hall, Inc., Englewood Cliffs, NJ, 1974.Google Scholar
  9. [9]
    William J. Dally. A VLSI Architecture for Concurrent Data Structures. PhD thesis, California Institute of Technology, 1986.Google Scholar
  10. [10]
    William J. Dally. The J-Machine: A fine-grain concurrent computer. In Proc. IFIP Congress, pages 1147–1153. North-Holland, August 1989.Google Scholar
  11. [11]
    Jack. J. Dongarra and Stanley C. Eisenstat. Squeezing the most out of an algorithm in Cray Fortran. ACM Trans. Math. Softw., 10(3):219–230, 1984.Google Scholar
  12. [12]
    M. Fiedler. Algebraic connectivity of graphs. Czechoslovak Mathematical Journal, 23:298–305, 1973.Google Scholar
  13. [13]
    M. Fiedler. Eigenvectors of acyclic matrices. Czechoslovak Mathematical Journal, 25:607–618, 1975.Google Scholar
  14. [14]
    M. Fiedler. A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory. Czechoslovak Mathematical Journal, 25:619–633, 1975.Google Scholar
  15. [15]
    Charles M. Flaig and Charles L Seitz. Inter-computer message routing system with each computer having separate routing automata for each dimension of the netwrok, 1988. U.S. Patent 5,105,424.Google Scholar
  16. [16]
    High Performance Fortran Forum. High performance fortran language specification, version 0.4. Technical report, Department of Computer Science, Rice University, November 1992.Google Scholar
  17. [17]
    Geoffrey C. Fox and Wojtek Furmanski. Optimal communication algorithms on the hypercube. Technical Report CCCP-314, California Institute of Technology, July 1986.Google Scholar
  18. [18]
    Geoffrey C. Fox, Mark A. Johnsson, Gregory A. Lyzenga, Steve W. Otto, John K. Salmon, and Wojtek Furmanski. Solving Problems on Concurrent Processors. Prentice-Hall, 1988.Google Scholar
  19. [19]
    William George, Ralph G. Brickner, and S. Lennart Johnsson. Polyshift communications software for the Connection Machine systems CM-2 and CM-200. Technical report, Thinking Machines Corp., March 1992.Google Scholar
  20. [20]
    Gene Golub and Charles vanLoan. Matrix Computations. The Johns Hopkins University Press, 1985.Google Scholar
  21. [21]
    Leslie Greengard and Vladimir Rokhlin. A fast algorithm for particle simulations. Journal of Computational Physics, 73:325–348, 1987.Google Scholar
  22. [22]
    I. Havel and J. Móravek. B-valuations of graphs. Czech. Math. J., 22:338–351, 1972.Google Scholar
  23. [23]
    Ching-Tien Ho and S. Lennart Johnsson. Spanning balanced trees in Boolean cubes. SIAM Journal on Sci. Stat. Comp, 10(4):607–630, July 1989.Google Scholar
  24. [24]
    Ching-Tien Ho and S. Lennart Johnsson. Embedding meshes in Boolean cubes by graph decomposition. J. of Parallel and Distributed Computing, 8(4):325–339, April 1990.Google Scholar
  25. [25]
    Zdenek Johan. Data Parallel Finite Element Techniques for Large-Scale Computational Fluid Dynamics. PhD thesis, Department of Mechanical Engineering, Stanford University, 1992.Google Scholar
  26. [26]
    Zdenek Johan and Thomas J. R. Hughes. An efficient implementation of the spectral partitioning algorithm on the connection machine systems. In International Conference on Computer Science and Control. INRIA, 1992.Google Scholar
  27. [27]
    S. Lennart Johnsson. Dense matrix operations on a torus and a Boolean cube. In The National Computer Conference, July 1985.Google Scholar
  28. [28]
    S. Lennart Johnsson. Communication efficient basic linear algebra computations on hypercube architectures. J. Parallel Distributed Computing, 4(2):133–172, April 1987.Google Scholar
  29. [29]
    S. Lennart Johnsson. Minimizing the communication time for matrix multiplication on multiprocessors. Technical Report TR-23-91, Harvard University, Division of Applied Sciences, September 1991. To appear in Parallel Computing.Google Scholar
  30. [30]
    S. Lennart Johnsson. Performance modeling of distributed memory architectures. J. Parallel and Distributed Computing, 12(4):300–312, August 1991.Google Scholar
  31. [31]
    S. Lennart Johnsson. Data ordering in multisection FFT. Technical report, Thinking Machines Corp., 1992. In preparation.Google Scholar
  32. [32]
    S. Lennart Johnsson. Compilation Techniques for Novel Architectures, chapter Language and Compiler Issues in Scalable High Performance Libraries. Springer Verlag, 1993. Harvard University Technical Report TR-18-92.Google Scholar
  33. [33]
    S. Lennart Johnsson and Ching-Tien Ho. Spanning graphs for optimum broadcasting and personalized communication in hypercubes. IEEE Trans. Computers, 38(9):1249–1268, September 1989.Google Scholar
  34. [34]
    S. Lennart Johnsson and Ching-Tien Ho. Generalized shuffle permutations on Boolean cubes. J. Parallel and Distributed Computing, 16(1):1–14, 1992.Google Scholar
  35. [35]
    S. Lennart Johnsson and Ching-Tien Ho. Optimal communication channel utilization for matrix transposition and related permutations on Boolean cubes. Discrete Applied Mathematics, 1992.Google Scholar
  36. [36]
    S. Lennart Johnsson and Ching-Tien Ho. Boolean cube emulation of butterfly networks encoded by Gray code. Journal of Parallel and Distributed Computing, 1993. Department of Computer Science, Yale University, Technical Report, YALEU/DCS/RR-764, February, 1990.Google Scholar
  37. [37]
    S. Lennart Johnsson, Ching-Tien Ho, Michel Jacquemin, and Alan Ruttenberg. Computing fast Fourier transforms on Boolean cubes and related networks. In Advanced Algorithms and Architectures for Signal Processing II, volume 826, pages 223–231. Society of Photo-Optical Instrumentation Engineers, 1987.Google Scholar
  38. [38]
    S. Lennart Johnsson, Michel Jacquemin, and Robert L. Krawitz. Communication efficient multi-processor FFT. Journal of Computational Physics, 102(2):381–397, October 1992.Google Scholar
  39. [39]
    S. Lennart Johnsson and Robert L. Krawitz. Cooley-Tukey FFT on the Connection Machine. Parallel Computing, 18(11):1201–1221, 1992.Google Scholar
  40. [40]
    Monica S. Lam, Edward E. Rothenberg, and Michael E. Wolf. The cache performance and optimizations of blocked algorithms. In The Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63–74. ACM Press, 1991.Google Scholar
  41. [41]
    Guangye Li and Thomas F. Coleman, A parallel triangular solver for a distributed memory multiprocessor. SIAM J. Sci. Statist. Comput., 9(3):485–502, 1988.Google Scholar
  42. [42]
    Guangye Li and Thomas F. Coleman. A new method for solving triangular systems on a distributed memory message-passing multiprocessor. SIAM J. Sci. Statist. Comput., 10(2):382–396, 1989.Google Scholar
  43. [43]
    Woody Lichtenstein and S. Lennart Johnsson. Block cyclic dense linear algebra. SIAM Journal of Scientific Computing, 14(5), 1993. Thinking Machines Corp., Technical Report, TMC-215, December 1991.Google Scholar
  44. [44]
    Christoffer Lutz, Steve Rabin, Charles L. Seitz, and Donald Speck. Design of the mosaic element. In Proceedings, Conf. on Advanced research in VLSI, pages 1–10. Artech House, 1984.Google Scholar
  45. [45]
    Kapil K. Mathur and S. Lennart Johnsson. Multiplication of matrices of arbitrary shape on a Data Parallel Computer. Technical Report 216, Thinking Machines Corp., December 1991.Google Scholar
  46. [46]
    Kapil K. Mathur and S. Lennart Johnsson. All-to-all communication. Technical Report 243, Thinking Machines Corp., December 1992.Google Scholar
  47. [47]
    Kapil K. Mathur and S. Lennart Johnsson. Communication primitives for unstructured finite element simulations on data parallel architectures. Computing Systems in Engineering, 3(1–4):63–72, December 1992.Google Scholar
  48. [48]
    Alex Pothen, Horst D. Simon, and Kang-Pu Liou. Partitioning sparse matrices with eigenvectors of graphs. SIAM J. Matrix Anal. Appl., 11(3):430–452, 1990.Google Scholar
  49. [49]
    Abhiram Ranade. How to emulate shared memory. In Proceedings of the 28th Annual Symposium on the Foundations of Computer Science, pages 185–194. IEEE Computer Society, October 1987.Google Scholar
  50. [50]
    Abhiram Ranade and S. Lennart Johnsson. The communication efficiency of meshes, Boolean cubes, and cube connected cycles for wafer scale integration. In 1987 International Conf. on Parallel Processing, pages 479–482. IEEE Computer Society, 1987.Google Scholar
  51. [51]
    Abhiram G. Ranade, Sandeep N. Bhatt, and S. Lennart Johnsson. The Fluent abstract machine. In Advanced Research in VLSI, Proceedings of the fifth MIT VLSI Conference, pages 71–93. MIT Press, 1988.Google Scholar
  52. [52]
    E.M. Reingold, J. Nievergelt, and N. Deo. Combinatorial Algorithms. Prentice-Hall, Englewood Cliffs. NJ, 1977.Google Scholar
  53. [53]
    Arnold L. Rosenberg. Preserving proximity in arrays. SIAM J. Computing, 4:443–460, 1975.Google Scholar
  54. [54]
    Horst D. Simon. Partitioning of unstructured problems for parallel processing. Computing Systems in Engineering, 2:135–148, 1991.Google Scholar
  55. [55]
    Quentin F. Stout and Bruce Wagar. Intensive hypercube communication I: prearranged communication in link-bound machines. Technical Report CRL-TR-9-87, Computing Research Lab., Univ. of Michigan, Ann Arbor, MI, 1987.Google Scholar
  56. [56]
    Quentin F. Stout and Bruce Wagar. Passing messages in link-bound hypercubes. In Michael T. Heath, editor, Hypercube Multiprocessors 1987. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1987.Google Scholar
  57. [57]
    Paul N. Swarztrauber. Symmetric FFTs. Mathematics of Computation, 47(175):323–346, July 1986.Google Scholar
  58. [58]
    Paul N. Swarztrauber. Multiprocessor FFTs. Parallel Computing, 5:197–210, 1987.Google Scholar
  59. [59]
    Clive Temperton. On the FACR(1) algorithm for the discrete Poisson equatron. J. of Computational Physics, 34:314–329, 1980.Google Scholar
  60. [60]
    Thinking Machines Corp. CMSSL for Fortran, 1990.Google Scholar
  61. [61]
    Thinking Machines Corp. CM-200 Technical Summary, 1991.Google Scholar
  62. [62]
    Thinking Machines Corp. CM-5 Technical Summary, 1991.Google Scholar
  63. [63]
    Thinking Machines Corp. CM Fortran optimization notes: slicewise model, version 1.0, 1991.Google Scholar
  64. [64]
    Charles Tong and Paul N. Swarztrauber. Ordered Fast Fourier transforms on a masively parallel hypercube multiprocessor. Journal of Parallel and Distributed Computing, 12(1):50–59, May 1991.Google Scholar
  65. [65]
    Leslie Valiant. A scheme for fast parallel communication. SIAM Journal on Computing, 11:350–361, 1982.Google Scholar
  66. [66]
    Leslie Valiant and G.J. Brebner. Universal schemes for parallel communication. In Proc. of the 13th ACM Symposium on the Theory of Computation, pages 263–277. ACM, 1981.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1993

Authors and Affiliations

  • S. Lennart Johnsson
    • 1
    • 2
  1. 1.Division of Applied SciencesHarvard UniversityCambridge
  2. 2.Thinking Machines Corp.USA

Personalised recommendations