Abstract
Scientific programmers are accustomed to expressing in their programs the “who” (variable declarations) and the “what” (operations), in some sequentialized order, and leaving to the systems software and hardware the questions of “when” and “where”. This act of delegation is appropriate at the small scales, since programmer management of pipelines, multiple functional units, and multilevel caches is presently beyond reward, and the depth and complexity of such performance-motivated architectural developments are sure to increase. However, disregard for the differential costs of accessing different locations in memory (the “flat memory” model) can put unnecessary amounts of synchronization and data motion on the critical path of program execution. Different organization of algorithms leading to mathematically equivalent results can have very different levels of exposed synchronization and data motion, and algorithmicists of the future will have to be conscious of and adapt to the distributed and hierarchical aspects of memory architecture.
Plenty of examples of architecturally motivated algorithmic adaptations can be given today; we illustrate herein with examples from recent aerodynamics simulations. For this purpose, pseudo-transient Newton-KrylovSchwarz methods are briefly introduced and their parallel scalability in bulk synchronous SPMD applications is explored. We also indicate some fundamental limitations of bulk synchronous implicit solvers and propose asyn-chronous forms of nonlinear Schwarz methods as perhaps better adapted both to massively parallel architectures and strongly nonuniform applications. Suitably adapted PDE solvers seem to be readily extrapolated to the 100 Tflop/s capabilities envisioned in the corning decade. Making use of some novel quantitative metrics for the memory access efficiencies of high performance applications (“memtropy”) and for the local strength of nonlinearity (“tensoricity”) in applications with spatially nonuniform characteristics, we propose a migration path for scientific and engineering simulations towards the distributed and hierarchical Teraflops world, and we consider what simulations in this world will look like.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Balay, S., Gropp, W.D., McInnes, L.C. and Smith, B.F., 1998. PETSc 2.0 users manual,Technical Report ANL-95/11 - Revision 2.0.22, Argonne National Laboratory.
Baudet, G.M., 1978. Asynchronous Iterative Methods for MultiprocessorsJ. of the ACM25.pp. 226–244.
Cai, X.C., Gropp, W.D., Keyes, D.E., Melvin, R.G. and Young, D.P., 1998. Parallel Newton-Krylov-Schwarz algorithms for the transonic full potential equationSIAM J. Scientific Computing 19 pp 246–265.
Chazan, D. and Miranker, W., 1969. Chaotic RelaxationLinear Algebra and Its Applications2, pp. 199--222.
Culler, D.E., Singh, J.P. and Gupta, A., 1998.Parallel Computer ArchitectureMorgan-Kaufman.
Dennis, J.E. and Schnabel, R., 1973.Numerical Methods for Unconstrained Optimization and Nonlinear EquationsPrentice-Hall, 1973.
Dennis, J.E. and Torczon, V., 1991. Direct search methods on parallel machinesSIAM J. Optimization1, pp. 448–474.
Department of Energy, 1998. Accelerated Strategic Computing Initiative, http://www.11nl.gov
de Sturler, E. and van der Vorst, H.A., 1987. Reducing the Effect of Global Communication in GMRES(m) and CG on Parallel Distributed Memory ComputersApplied Numerical Mathematics18, pp. 441–459.
Dryja, M. and Widlund, O.B., 1987. An Additive Variant of the Schwarz Alternating Method for the Case of Many Subregions, Technical Report #339, Courant Institute, NYU.
Federal Coordinating Council For Science, Engineering, and Technology,1992. High Performance Computing and Communications Initiative.(See alsohttp://www.hpcc.gov.)
Gao, G.R., Theobald, K.B., Marquez, A. and Sterling, T. 1997. The HTMT Program Execution Model, CAPSL TM-09, ECE Department University of Delaware. [See alsohttp://htmt.cacr.caltech.edu/publicat.htm.]
Gropp, W.D., Keyes, D.E., McInnes, L.C. and Tidriri, M.D., 1998. Globalized NewtonKrylov-Schwarz Algorithms and Software for Parallel Implicit CFD, ICASE Technical Report 98–24, 36 pp. [To appear inInt. J. for High Performance Comput. Applies.]
Hestenes, M.R. and Stiefel, E., 1952. Methods of conjugate gradients for solving linear systemsJ. Res. Nat. Bur. Stand.49, pp. 409–435.
Hilbert, D., 1891. Uber die stetige Abbildung einer Linie auf ein FlächenstückMathematische Annalen38, pp. 459–460.
Ierotheou, C., Lai, C.H., Palansuriya, C.J. and Pericleous, K.A., 1998. Simulation of 2-D metal cutting by means of a distributed algorithmThe Computer Journal41, pp. 57–63.
Bilmes, J.Asanovie, K.Chin, C.W.and Demmel, J.1998.Optimizing Matrix Multiply Using PHiPAC: A Portable High-Performance ANSI C Methodology, inProceedings of the International Conference on SupercomputingVienna, Austria, July 1997 (ACM SIGARC)
Karypis, G. and Kumar, V., 1998. Multilevel Algorithms for Multi-Constraint Graph Partitioning, Technical Report 98–019, CS Department, University of Minnesota.
Kaushik, D.K., Keyes, D.E. and Smith, B.F., 1998. On the interaction of architecture and algorithm in the domain-based parallelization of an unstructured grid incompressible flow code, inProceedings of the Tenth International Conference on Domain Decomposition MethodsMandel, J. et al., eds., AMS, pp. 311–319.
Kelley, C.T. and Keyes, D.E., 1998. Convergence analysis of pseudo-transient continuationSIAM J. Numerical Analysis35, pp. 508–523.
Keyes, D.E. and Gropp, W.D., 1987. A Comparison of Domain Decomposition Techniques for Elliptic Partial Differential Equations and Their Parallel ImplementationSIAM J. Scientific and Statistical Computing8, pp. s166-s202.
Keyes, D.E., Kaushik, D.K. and Smith, B.F., 1998. Prospects for CFD on petaflops systems, inCFD Review 1997M. Hafez et al., eds., Wiley (to appear).
Lai, C.-H., 1997. An application of quasi-Newton methods for the numerical solution of interface problemsAdvances in Engineering Software28, pp. 333–339.
Lai, C.-H., Cuffe, A.M. and Pericleous, K.A., 1998. A defect equation approach for the coupling of subdomains in domain decomposition methodsComputers and Mathematics with Applications35, pp. 81–94.
Miellou, J.C., 1975. Itérations chaotiques a retardsComptes Rendus Ser. A280, pp. 233–236.
National Science Foundation, 1996. Grand Challenges, National Challenges, Multidisciplinary Computing Challenges.http://www.cise.nsf.gov/general/grand_challenge.html.
Peano, G., 1890. Sur une courbe qui remplit toute une aire planeMathematische Annalen 36pp. 157–160.
Reid, J.K., 1971. On the Method of Conjugate Gradients for the Solution of Large Sparse Systems of Linear Equations, inLarge Sparse Sets of Linear EquationsJ.K. Reid, ed., Academic Press, pp. 231–254.
Saad, Y. and Schultz, M.H., 1986. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systemsSIAM J. Scientific and Statistical Computing7, pp. 856–869.
Semiconductor Industry Association, 1998. The National Technology Roadmap for Semiconductors, 1997 Editionhttp://notes.sematech.org/97pelec.htm
Smith, B.F., Bjorstad, P.E. and Gropp, W.D., 1996.Domain Decomposition MethodsCambridge University Press, Cambridge.
Schwarz, H.A., 1890. Uber einen Grenzubergang durch Alternierenden Verfahren, [originally published in 1869] inGesammelte Mathematische Abhandlungen2, Springer Verlag, pp. 133–134
Soderlind, G., 1998. The Automatic Control of Numerical Integration, Technical Report LU-CS-TR:98–200, Lund Institute of Technology, Lund, Sweden.
Tseng, P., Bertsekas, D.P., and Tsitsiklis, J.N., 1990. Partially Asynchronous Parallel Algorithms for Network Flow and Other ProblemsSIAM J. Control and Optimization28, pp. 678–710.
Warren, M. and Salmon, J., 1995. A parallel, portable and versatile treecode, inSeventh SIAM Conference on Parallel Processing for Scientific ComputingSIAM, Philadelphia, pp. 319–324.
Warren, M., Salmon, J.K., Becker, D.J., Goda, M.P., Sterling, T. and Winckelmans, G.S., 1997. Pentium Pro inside: I. A treecode at 430 Gigaflops on ASCI Red. II. Price/performance of $50/Mflop on Loki and Hyglac, inSupercomputing ‘87IEEE Computer Society, Los Alamitos.
Whaley, R.C. and Dongarra, J., 1998. Automatically Tuned Linear Algebra Softwarehttp://www.netlib.org/atlas/index.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer Science+Business Media Dordrecht
About this paper
Cite this paper
Keyes, D.E. (2000). Trends in Algorithms for Nonuniform Applications on Hierarchical Distributed Architectures. In: Salas, M.D., Anderson, W.K. (eds) Computational Aerosciences in the 21st Century. ICASE LaRC Interdisciplinary Series in Science and Engineering, vol 8. Springer, Dordrecht. https://doi.org/10.1007/978-94-010-0948-5_6
Download citation
DOI: https://doi.org/10.1007/978-94-010-0948-5_6
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-010-3807-2
Online ISBN: 978-94-010-0948-5
eBook Packages: Springer Book Archive