Abstract
LU and QR factorizations are the most widely used method for solving dense linear systems of equations, and have been extensively studied and implemented on vector and parallel computers. Since each parallel computer has very different performance ratios of computation and communication, the optimal computational block sizes, which generate the maximum performance of an algorithm, are different from one another. Therefore, the data matrix must be distributed with the machine specific optimal block size before the computation. Too small or large block size makes getting good performance on a machine near impossible. In such a case, getting a better performance may require a complete redistribution of the data matrix.
In this chapter, we present parallel LU and QR factorization algorithms with an “algorithmic blocking” strategy on 2-dimensional block cyclic data distribution. With the algorithmic blocking, it is possible obtaining the best performance irrespective of the physical block size. The algorithms are implemented and compared with the ScaLAPACK factorization routines on the Intel Paragon computer.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Agarwal, R. C., Gustayson, F. G., and Zubair, M. (1994). A High-Performance Matrix-Multiplication Algorithm on a Distributed-Memory Parallel Computer Using Overlapped Communication. IBM Journal of Research and Development, 38(6):673–681.
Anderson, E., Bai, Z., Bischof, C., Demmel, J., Dongarra, J., DuCroz, J., Greenbaum, A., Hammarling, S., McKenney, A., and Sorensen, D. (1990). LAPACK: A Portable Linear Algebra Library for High-Performance Computers. In Proceedings of Supercomputing ‘80, pages 1–10. IEEE Press.
Bangalore, P. V. (1995). The Data-Distribution-Independent Approach to Scalable Parallel Libraries. Master Thesis, Mississippi State University.
Blackford, L., Choi, J., Cleary, A., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., and Whaley, R. (1997a). ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance. In Proceedings of SIAM Conference on Parallel Processing.
Blackford, L., Choi, J., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., and Whaley, R. (1997b). ScaLAPACK Users’ Guide. SIAM Press, Philadelphia, PA.
Choi, J. (1998). A New Parallel Univeral Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers. Concurrency: Practice and Experience, 10:655–670.
Choi, J., Dongarra, J. J., Ostrouchov, S., Petitet, A. P., Walker, D. W., and Whaley, R. C. (1995). A Proposal for a Set of Parallel Basic Linear Algebra Subprograms. LAPACK Working Note 100, Technical Report CS-95–292, University of Tennessee.
Choi, J., Dongarra, J. J., Ostrouchov, S., Petitet, A. P., Walker, D. W., and Whaley, R. C. (1996). The Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines. Scientific Programming, 5:173–184.
Choi, J., Dongarra, J. J., and Walker, D. W. (1992). The Design of Scalable Software Libraries for Distributed Memory Concurrent Computers. In Proceedings of Environment and Tools for Parallel Scientific Computing Workshop, (Saint Hilaire du Touvet,France), pages 3–15. Elsevier Science Publishers.
Choi, J., Dongarra, J. J., and Walker, D. W. (1994). PUMMA: Parallel Universal Matrix Multiplication Algorithms on Distributed Memory Concurrent Computers. Concurrency: Practice and Experience, 6:543–570.
Dongarra, J. J. and Ostrouchov, S. (1990). LAPACK Block Factorization Algorithms on the Intel iPSC/860. LAPACK Working Note 24, Technical Report CS-90–115, University of Tennessee.
Golub, G. H. and Loan, C. V. V. (1989). Matrix Computations. The Johns Hopkins University Press, Baltimore, MD. Second Edition.
Huss-Lederman, S., Jacobson, E. M., Tsao, A., and Zhang, G. (1994). Matrix Multiplication on the Intel Touchstone Delta. Concurrency: Practice and Experience, 6:571–594.
Kumar, V., Grama, A., Gupta, A., and Karypis, G. (1994). Introduction to Parallel Computing. The Benjamin/Cummings Publishing Company, Inc., Redwood City, CA.
Li, G. and Coleman, T. F. (1986). A Parallel Triangular Solver for a Distributed-Memory Multiprocessor. SIAM J. of Sci. Stat. Computing, 9:485–502.
Lichtenstein, W. and Johnsson, S. L. (1993). Block-Cyclic Dense Linear Algebra. SIAM J. of Sci. Stat. Computing, 14(6): 1259–1288.
Petitet, A. (1996). Algorithmic Redistribution Methods for Block Cyclic Decompositions. Ph.D. Thesis, University of Tennessee, Knoxville.
van de Geijn, R. and Watts, J. (1995). SUMMA Scalable Universal Matrix Multiplication Algorithm. LAPACK Working Note 99, Technical Report CS-95–286, University of Tennessee.
van de Geijn, R. A. (1997). Using PLAPACK. The MIT Press, Cambridge.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1999 Springer Science+Business Media New York
About this chapter
Cite this chapter
Choi, J. (1999). Parallel Factorization Algorithms with Algorithmic Blocking. In: Yang, T. (eds) Parallel Numerical Computation with Applications. The Springer International Series in Engineering and Computer Science, vol 515. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-5205-5_2
Download citation
DOI: https://doi.org/10.1007/978-1-4615-5205-5_2
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4613-7371-1
Online ISBN: 978-1-4615-5205-5
eBook Packages: Springer Book Archive