Fast Parallel Algorithms for Blocked Dense Matrix Multiplication on Shared Memory Architectures

Nimako, G.; Otoo, E. J.; Ohene-Kwofie, D.

doi:10.1007/978-3-642-33078-0_32

G. Nimako²²,
E. J. Otoo²² &
D. Ohene-Kwofie²²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7439))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

2093 Accesses
5 Citations

Abstract

The current trend of multicore and Symmetric Multi-Processor (SMP), architectures underscores the need for parallelism in most scientific computations. Matrix-matrix multiplication is one of the fundamental computations in many algorithms for scientific and numerical analysis. Although a number of different algorithms (such as Cannon, PUMMA, SUMMA etc), have been proposed for the implementation of matrix-matrix multiplication on distributed memory architectures, matrix-matrix algorithms for multicore and SMP architectures have not been extensively studied. We present two types of algorithms, based largely on blocked dense matrices, for parallel matrix-matrix multiplication on shared memory systems. The first algorithm is based on blocked matrices whiles the second algorithm uses blocked matrices with the MapReduce framework in shared memory. Our experimental results show that, our blocked dense matrix approach outperforms the known existing implementations by up to 50% whiles our MapReduce blocked matrix-matrix algorithm outperforms the existing matrix-matrix multiplication algorithm of the Phoenix shared memory MapReduce approach, by about 40%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Cannon, L.E.: A cellular computer to implement the kalman filter algorithm. PhD thesis, Montana State University (1969)
Google Scholar
Choi, J., Dongarra, J., Walker, D.: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers. Concurrency: Practice and Experience 6(7), 543–570 (1994)
Article Google Scholar
van de Geijn, R.A., Watts, J.: Scalable universal matrix multiplication algorithm. Concurrency: Practice and Experience 9(4), 255–274 (1997)
Article Google Scholar
Krishnan, M., Nieplocha, J.: Srumma: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems. In: Proceedings of Parallel and Distributed Processing Symposium (2004)
Google Scholar
Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating mapreduce for multi-core and multiprocessor systems. In: Proc. of the 13th Int’l Symposium on High Performance Computer Architecture, pp. 13–24 (2007)
Google Scholar
Yoo, R.M., Romano, A., Kozyrakis, C.: Phoenix rebirth: Scalable mapreduce on a large-scale shared-memory system. In: Proc. of the 2009 IEEE Int’l Symposium on Workload Characterization, pp. 198–207 (2009)
Google Scholar
Dean, J., Ghemawat, J.: Mapreduce: Simplified data processing on large clusters. In: Proceedings of the 6th Symp. on Operating Systems Design and Implementation (2004)
Google Scholar
Blackford, L., Choi, J., Cleary, A., DAzevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.: Scalapack users guide. SIAM, Philadelphia (1997)
Book MATH Google Scholar
Anderson, E., Bai, Z., Bischof, C., Blackford, L., Demmel, J., Dongarra, J., Hammarling, S., Croz, J., Greenbaum, A., McKenney, A., Sorensen, D.: Lapack users guide. SIAM, Philadelphia (1992)
MATH Google Scholar
Jakub, K., Ltaief, H., Dongarra, J., Badia, R.: Scheduling dense linear algebra operations on multicore processors. Concurrency and Computation: Practice and Experience 22(1) (2010)
Google Scholar
Bentz, J.L., Kendall, R.A.: Parallelization of General Matrix Multiply Routines Using OpenMP. In: Chapman, B.M. (ed.) WOMPAT 2004. LNCS, vol. 3349, pp. 1–11. Springer, Heidelberg (2005)
Chapter Google Scholar
Hackenberg, D., Schöne, R., Nagel, W.E., Pflüger, S.: Optimizing OpenMP Parallelized DGEMM Calls on SGI Altix 3700. In: Nagel, W.E., Walter, W.V., Lehner, W. (eds.) Euro-Par 2006. LNCS, vol. 4128, pp. 145–154. Springer, Heidelberg (2006)
Chapter Google Scholar
Alpatov, P., Baker, G., Edwards, C., Gunnels, J., Morrow, G., Overfelt, J., van de Geiju, R., Wu, J.: Plapack: Parallel linear algebra package. In: Proceedings of the SIAM Parallel Processing Conference (1997)
Google Scholar
Strassen, V.: Guassian elimination is not optimal. Numerische Mathematick 14(3), 354–356 (1969)
Article MathSciNet Google Scholar
Coppersmith, D., Winograd, S.: Matrix multiplication via arithmetic progressions. Journal of Symbolic Computing 9, 251–280 (1990)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, The University of the Witwatersrand, Johannesburg, South Africa
G. Nimako, E. J. Otoo & D. Ohene-Kwofie

Authors

G. Nimako
View author publications
You can also search for this author in PubMed Google Scholar
E. J. Otoo
View author publications
You can also search for this author in PubMed Google Scholar
D. Ohene-Kwofie
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Information Technology, Deakin University, Melbourne Burwood Campus, 221 Burwood Highway, 3125, Burwood, VIC, Australia
Yang Xiang
SEECS, University of Ottawa, 8, King Edward Ave, K1N 6N5, Ottawa, ON, Canada
Ivan Stojmenovic
Department of Intelligent Informatics, Kyushu Sangyo University, 2-3-1 Matsukadai, Higashi-ku, 813-8503, Fukuoka, Japan
Bernady O. Apduhan
School of Information Science and Engineering, Central South University, 410083, Changsha, Hunan Province, P.R. China
Guojun Wang
Department of Information Engineering, Hiroshima University, 1-4-1, Kagamiyama, 739-8527, Higashi-Hiroshima, Japan
Koji Nakano
School of Information Technologies, University of Sydney, Building J12, 2006, Sydney, NSW, Australia
Albert Zomaya

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nimako, G., Otoo, E.J., Ohene-Kwofie, D. (2012). Fast Parallel Algorithms for Blocked Dense Matrix Multiplication on Shared Memory Architectures. In: Xiang, Y., Stojmenovic, I., Apduhan, B.O., Wang, G., Nakano, K., Zomaya, A. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2012. Lecture Notes in Computer Science, vol 7439. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33078-0_32

Download citation

DOI: https://doi.org/10.1007/978-3-642-33078-0_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33077-3
Online ISBN: 978-3-642-33078-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics