Massively parallel computing: Data distribution and communication

Johnsson, S. Lennart

doi:10.1007/3-540-56731-3_9

S. Lennart Johnsson^1,2

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 678))

Included in the following conference series:

Heinz Nixdorf Symposium at the University of Paderborn

134 Accesses
4 Citations

Abstract

We discuss some techniques for preserving locality of reference in index spaces when mapped to memory units in a distributed memory architecture. In particular, we discuss the use of multidimensional address spaces instead of linearized address spaces, partitioning of irregular grids, and placement of partitions among nodes. We also discuss a set of communication primitives we have found very useful on the Connection Machine systems in implementing scientific and engineering applications. We briefly review some of the techniques used to fully utilize the bandwidth of the binary cube network of the CM-2 and CM-200, and give some performance data from implementations of communication primitives.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

B. Alspach, J.-C. Bermond, and D. Sotteau. Decomposition into cycles i: Hamilton decompositions. In G. Hahn et. al., editor, Cycles and Graphs, pages 9–18. Kluwer Academic Publishers, 1990.
Google Scholar
Christopher R. Anderson. An implementation of the fast multipole method without multipoles. SIAM J. Sci. Stat. Comp., 13(4):923–947, July 1992.
Google Scholar
D. P. Bertsekas, C. Ozveren, G.D. Stamoulis, P. Tseng, and J.N. Tsitsiklis. Optimal communication algorithms for hypercubes. Journal of Parallel and Distributed Computing, 11:263–275, 1991.
Google Scholar
M. Bromley, Steve Heller, Tim McNerny, and Guy Steele. Fortran at ten Gigaflops: The Connection Machine convolution compiler. In Proceedings of ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation. ACM Press, 1991.
Google Scholar
Jean-Philippe Brunet and S. Lennart Johnsson. All-to-all broadcast with applications on the Connection Machine. International Journal of Supercomputer Applications, 6(3):241–256, 1992.
Google Scholar
J. Carrier, L. Greengard, and V. Rokhlin. A fast adaptive multipole algorithm for particle simulations. SIAM J. of Scientific and Statistical Computations, 9(4):669–686, July 1988.
Google Scholar
M.Y. Chan. Embedding of grids into optimal hypercubes. SIAM J. Computing, 20(5):834–864, 1991.
Google Scholar
G. Dahlquist, Å. Björck, and N. Anderson. Numerical Methods. Series in Automatic Computation. Prentice Hall, Inc., Englewood Cliffs, NJ, 1974.
Google Scholar
William J. Dally. A VLSI Architecture for Concurrent Data Structures. PhD thesis, California Institute of Technology, 1986.
Google Scholar
William J. Dally. The J-Machine: A fine-grain concurrent computer. In Proc. IFIP Congress, pages 1147–1153. North-Holland, August 1989.
Google Scholar
Jack. J. Dongarra and Stanley C. Eisenstat. Squeezing the most out of an algorithm in Cray Fortran. ACM Trans. Math. Softw., 10(3):219–230, 1984.
Google Scholar
M. Fiedler. Algebraic connectivity of graphs. Czechoslovak Mathematical Journal, 23:298–305, 1973.
Google Scholar
M. Fiedler. Eigenvectors of acyclic matrices. Czechoslovak Mathematical Journal, 25:607–618, 1975.
Google Scholar
M. Fiedler. A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory. Czechoslovak Mathematical Journal, 25:619–633, 1975.
Google Scholar
Charles M. Flaig and Charles L Seitz. Inter-computer message routing system with each computer having separate routing automata for each dimension of the netwrok, 1988. U.S. Patent 5,105,424.
Google Scholar
High Performance Fortran Forum. High performance fortran language specification, version 0.4. Technical report, Department of Computer Science, Rice University, November 1992.
Google Scholar
Geoffrey C. Fox and Wojtek Furmanski. Optimal communication algorithms on the hypercube. Technical Report CCCP-314, California Institute of Technology, July 1986.
Google Scholar
Geoffrey C. Fox, Mark A. Johnsson, Gregory A. Lyzenga, Steve W. Otto, John K. Salmon, and Wojtek Furmanski. Solving Problems on Concurrent Processors. Prentice-Hall, 1988.
Google Scholar
William George, Ralph G. Brickner, and S. Lennart Johnsson. Polyshift communications software for the Connection Machine systems CM-2 and CM-200. Technical report, Thinking Machines Corp., March 1992.
Google Scholar
Gene Golub and Charles vanLoan. Matrix Computations. The Johns Hopkins University Press, 1985.
Google Scholar
Leslie Greengard and Vladimir Rokhlin. A fast algorithm for particle simulations. Journal of Computational Physics, 73:325–348, 1987.
Google Scholar
I. Havel and J. Móravek. B-valuations of graphs. Czech. Math. J., 22:338–351, 1972.
Google Scholar
Ching-Tien Ho and S. Lennart Johnsson. Spanning balanced trees in Boolean cubes. SIAM Journal on Sci. Stat. Comp, 10(4):607–630, July 1989.
Google Scholar
Ching-Tien Ho and S. Lennart Johnsson. Embedding meshes in Boolean cubes by graph decomposition. J. of Parallel and Distributed Computing, 8(4):325–339, April 1990.
Google Scholar
Zdenek Johan. Data Parallel Finite Element Techniques for Large-Scale Computational Fluid Dynamics. PhD thesis, Department of Mechanical Engineering, Stanford University, 1992.
Google Scholar
Zdenek Johan and Thomas J. R. Hughes. An efficient implementation of the spectral partitioning algorithm on the connection machine systems. In International Conference on Computer Science and Control. INRIA, 1992.
Google Scholar
S. Lennart Johnsson. Dense matrix operations on a torus and a Boolean cube. In The National Computer Conference, July 1985.
Google Scholar
S. Lennart Johnsson. Communication efficient basic linear algebra computations on hypercube architectures. J. Parallel Distributed Computing, 4(2):133–172, April 1987.
Google Scholar
S. Lennart Johnsson. Minimizing the communication time for matrix multiplication on multiprocessors. Technical Report TR-23-91, Harvard University, Division of Applied Sciences, September 1991. To appear in Parallel Computing.
Google Scholar
S. Lennart Johnsson. Performance modeling of distributed memory architectures. J. Parallel and Distributed Computing, 12(4):300–312, August 1991.
Google Scholar
S. Lennart Johnsson. Data ordering in multisection FFT. Technical report, Thinking Machines Corp., 1992. In preparation.
Google Scholar
S. Lennart Johnsson. Compilation Techniques for Novel Architectures, chapter Language and Compiler Issues in Scalable High Performance Libraries. Springer Verlag, 1993. Harvard University Technical Report TR-18-92.
Google Scholar
S. Lennart Johnsson and Ching-Tien Ho. Spanning graphs for optimum broadcasting and personalized communication in hypercubes. IEEE Trans. Computers, 38(9):1249–1268, September 1989.
Google Scholar
S. Lennart Johnsson and Ching-Tien Ho. Generalized shuffle permutations on Boolean cubes. J. Parallel and Distributed Computing, 16(1):1–14, 1992.
Google Scholar
S. Lennart Johnsson and Ching-Tien Ho. Optimal communication channel utilization for matrix transposition and related permutations on Boolean cubes. Discrete Applied Mathematics, 1992.
Google Scholar
S. Lennart Johnsson and Ching-Tien Ho. Boolean cube emulation of butterfly networks encoded by Gray code. Journal of Parallel and Distributed Computing, 1993. Department of Computer Science, Yale University, Technical Report, YALEU/DCS/RR-764, February, 1990.
Google Scholar
S. Lennart Johnsson, Ching-Tien Ho, Michel Jacquemin, and Alan Ruttenberg. Computing fast Fourier transforms on Boolean cubes and related networks. In Advanced Algorithms and Architectures for Signal Processing II, volume 826, pages 223–231. Society of Photo-Optical Instrumentation Engineers, 1987.
Google Scholar
S. Lennart Johnsson, Michel Jacquemin, and Robert L. Krawitz. Communication efficient multi-processor FFT. Journal of Computational Physics, 102(2):381–397, October 1992.
Google Scholar
S. Lennart Johnsson and Robert L. Krawitz. Cooley-Tukey FFT on the Connection Machine. Parallel Computing, 18(11):1201–1221, 1992.
Google Scholar
Monica S. Lam, Edward E. Rothenberg, and Michael E. Wolf. The cache performance and optimizations of blocked algorithms. In The Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 63–74. ACM Press, 1991.
Google Scholar
Guangye Li and Thomas F. Coleman, A parallel triangular solver for a distributed memory multiprocessor. SIAM J. Sci. Statist. Comput., 9(3):485–502, 1988.
Google Scholar
Guangye Li and Thomas F. Coleman. A new method for solving triangular systems on a distributed memory message-passing multiprocessor. SIAM J. Sci. Statist. Comput., 10(2):382–396, 1989.
Google Scholar
Woody Lichtenstein and S. Lennart Johnsson. Block cyclic dense linear algebra. SIAM Journal of Scientific Computing, 14(5), 1993. Thinking Machines Corp., Technical Report, TMC-215, December 1991.
Google Scholar
Christoffer Lutz, Steve Rabin, Charles L. Seitz, and Donald Speck. Design of the mosaic element. In Proceedings, Conf. on Advanced research in VLSI, pages 1–10. Artech House, 1984.
Google Scholar
Kapil K. Mathur and S. Lennart Johnsson. Multiplication of matrices of arbitrary shape on a Data Parallel Computer. Technical Report 216, Thinking Machines Corp., December 1991.
Google Scholar
Kapil K. Mathur and S. Lennart Johnsson. All-to-all communication. Technical Report 243, Thinking Machines Corp., December 1992.
Google Scholar
Kapil K. Mathur and S. Lennart Johnsson. Communication primitives for unstructured finite element simulations on data parallel architectures. Computing Systems in Engineering, 3(1–4):63–72, December 1992.
Google Scholar
Alex Pothen, Horst D. Simon, and Kang-Pu Liou. Partitioning sparse matrices with eigenvectors of graphs. SIAM J. Matrix Anal. Appl., 11(3):430–452, 1990.
Google Scholar
Abhiram Ranade. How to emulate shared memory. In Proceedings of the 28th Annual Symposium on the Foundations of Computer Science, pages 185–194. IEEE Computer Society, October 1987.
Google Scholar
Abhiram Ranade and S. Lennart Johnsson. The communication efficiency of meshes, Boolean cubes, and cube connected cycles for wafer scale integration. In 1987 International Conf. on Parallel Processing, pages 479–482. IEEE Computer Society, 1987.
Google Scholar
Abhiram G. Ranade, Sandeep N. Bhatt, and S. Lennart Johnsson. The Fluent abstract machine. In Advanced Research in VLSI, Proceedings of the fifth MIT VLSI Conference, pages 71–93. MIT Press, 1988.
Google Scholar
E.M. Reingold, J. Nievergelt, and N. Deo. Combinatorial Algorithms. Prentice-Hall, Englewood Cliffs. NJ, 1977.
Google Scholar
Arnold L. Rosenberg. Preserving proximity in arrays. SIAM J. Computing, 4:443–460, 1975.
Google Scholar
Horst D. Simon. Partitioning of unstructured problems for parallel processing. Computing Systems in Engineering, 2:135–148, 1991.
Google Scholar
Quentin F. Stout and Bruce Wagar. Intensive hypercube communication I: prearranged communication in link-bound machines. Technical Report CRL-TR-9-87, Computing Research Lab., Univ. of Michigan, Ann Arbor, MI, 1987.
Google Scholar
Quentin F. Stout and Bruce Wagar. Passing messages in link-bound hypercubes. In Michael T. Heath, editor, Hypercube Multiprocessors 1987. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1987.
Google Scholar
Paul N. Swarztrauber. Symmetric FFTs. Mathematics of Computation, 47(175):323–346, July 1986.
Google Scholar
Paul N. Swarztrauber. Multiprocessor FFTs. Parallel Computing, 5:197–210, 1987.
Google Scholar
Clive Temperton. On the FACR(1) algorithm for the discrete Poisson equatron. J. of Computational Physics, 34:314–329, 1980.
Google Scholar
Thinking Machines Corp. CMSSL for Fortran, 1990.
Google Scholar
Thinking Machines Corp. CM-200 Technical Summary, 1991.
Google Scholar
Thinking Machines Corp. CM-5 Technical Summary, 1991.
Google Scholar
Thinking Machines Corp. CM Fortran optimization notes: slicewise model, version 1.0, 1991.
Google Scholar
Charles Tong and Paul N. Swarztrauber. Ordered Fast Fourier transforms on a masively parallel hypercube multiprocessor. Journal of Parallel and Distributed Computing, 12(1):50–59, May 1991.
Google Scholar
Leslie Valiant. A scheme for fast parallel communication. SIAM Journal on Computing, 11:350–361, 1982.
Google Scholar
Leslie Valiant and G.J. Brebner. Universal schemes for parallel communication. In Proc. of the 13th ACM Symposium on the Theory of Computation, pages 263–277. ACM, 1981.
Google Scholar

Download references

Author information

Authors and Affiliations

Division of Applied Sciences, Harvard University, 02138, Cambridge, MA
S. Lennart Johnsson
Thinking Machines Corp., USA
S. Lennart Johnsson

Authors

S. Lennart Johnsson
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

F. Meyer B. Monien A. L. Rosenberg

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Johnsson, S.L. (1993). Massively parallel computing: Data distribution and communication. In: Meyer, F., Monien, B., Rosenberg, A.L. (eds) Parallel Architectures and Their Efficient Use. Nixdorf 1992. Lecture Notes in Computer Science, vol 678. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-56731-3_9

Download citation

DOI: https://doi.org/10.1007/3-540-56731-3_9
Published: 28 May 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-56731-8
Online ISBN: 978-3-540-47637-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics