Abstract
The current trend of multicore and Symmetric Multi-Processor (SMP), architectures underscores the need for parallelism in most scientific computations. Matrix-matrix multiplication is one of the fundamental computations in many algorithms for scientific and numerical analysis. Although a number of different algorithms (such as Cannon, PUMMA, SUMMA etc), have been proposed for the implementation of matrix-matrix multiplication on distributed memory architectures, matrix-matrix algorithms for multicore and SMP architectures have not been extensively studied. We present two types of algorithms, based largely on blocked dense matrices, for parallel matrix-matrix multiplication on shared memory systems. The first algorithm is based on blocked matrices whiles the second algorithm uses blocked matrices with the MapReduce framework in shared memory. Our experimental results show that, our blocked dense matrix approach outperforms the known existing implementations by up to 50% whiles our MapReduce blocked matrix-matrix algorithm outperforms the existing matrix-matrix multiplication algorithm of the Phoenix shared memory MapReduce approach, by about 40%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Cannon, L.E.: A cellular computer to implement the kalman filter algorithm. PhD thesis, Montana State University (1969)
Choi, J., Dongarra, J., Walker, D.: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers. Concurrency: Practice and Experience 6(7), 543–570 (1994)
van de Geijn, R.A., Watts, J.: Scalable universal matrix multiplication algorithm. Concurrency: Practice and Experience 9(4), 255–274 (1997)
Krishnan, M., Nieplocha, J.: Srumma: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems. In: Proceedings of Parallel and Distributed Processing Symposium (2004)
Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating mapreduce for multi-core and multiprocessor systems. In: Proc. of the 13th Int’l Symposium on High Performance Computer Architecture, pp. 13–24 (2007)
Yoo, R.M., Romano, A., Kozyrakis, C.: Phoenix rebirth: Scalable mapreduce on a large-scale shared-memory system. In: Proc. of the 2009 IEEE Int’l Symposium on Workload Characterization, pp. 198–207 (2009)
Dean, J., Ghemawat, J.: Mapreduce: Simplified data processing on large clusters. In: Proceedings of the 6th Symp. on Operating Systems Design and Implementation (2004)
Blackford, L., Choi, J., Cleary, A., DAzevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.: Scalapack users guide. SIAM, Philadelphia (1997)
Anderson, E., Bai, Z., Bischof, C., Blackford, L., Demmel, J., Dongarra, J., Hammarling, S., Croz, J., Greenbaum, A., McKenney, A., Sorensen, D.: Lapack users guide. SIAM, Philadelphia (1992)
Jakub, K., Ltaief, H., Dongarra, J., Badia, R.: Scheduling dense linear algebra operations on multicore processors. Concurrency and Computation: Practice and Experience 22(1) (2010)
Bentz, J.L., Kendall, R.A.: Parallelization of General Matrix Multiply Routines Using OpenMP. In: Chapman, B.M. (ed.) WOMPAT 2004. LNCS, vol. 3349, pp. 1–11. Springer, Heidelberg (2005)
Hackenberg, D., Schöne, R., Nagel, W.E., Pflüger, S.: Optimizing OpenMP Parallelized DGEMM Calls on SGI Altix 3700. In: Nagel, W.E., Walter, W.V., Lehner, W. (eds.) Euro-Par 2006. LNCS, vol. 4128, pp. 145–154. Springer, Heidelberg (2006)
Alpatov, P., Baker, G., Edwards, C., Gunnels, J., Morrow, G., Overfelt, J., van de Geiju, R., Wu, J.: Plapack: Parallel linear algebra package. In: Proceedings of the SIAM Parallel Processing Conference (1997)
Strassen, V.: Guassian elimination is not optimal. Numerische Mathematick 14(3), 354–356 (1969)
Coppersmith, D., Winograd, S.: Matrix multiplication via arithmetic progressions. Journal of Symbolic Computing 9, 251–280 (1990)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nimako, G., Otoo, E.J., Ohene-Kwofie, D. (2012). Fast Parallel Algorithms for Blocked Dense Matrix Multiplication on Shared Memory Architectures. In: Xiang, Y., Stojmenovic, I., Apduhan, B.O., Wang, G., Nakano, K., Zomaya, A. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2012. Lecture Notes in Computer Science, vol 7439. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33078-0_32
Download citation
DOI: https://doi.org/10.1007/978-3-642-33078-0_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33077-3
Online ISBN: 978-3-642-33078-0
eBook Packages: Computer ScienceComputer Science (R0)