Over the past five years almost all computer manufacturers have dramatically changed their computer architectures to Multicore (MC) processors. We briefly describe Cache Blocking as it relates to computer architectures since about 1985 by covering the where, when, how and why of Cache Blocking as it relates to dense linear algebra. It will be seen that the arrangement in memory of the submatrices A ij of A that are being processed is very important.


Cache Blocking Matrix Transposition Cell Processor Array Space Dense Linear Algebra 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Agarwal, R.C., Cooley, J.W., Gustavson, F.G., Shearer, J.B., Slishman, G., Tuckerman, B.: New scalar and vector elementary functions for the IBM System/370. IBM Journal of Research and Development 30(2), 126–144 (1986)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Agarwal, R.C., Gustavson, F.G.: A Parallel Implementation of Matrix Multiplication and LU factorization on the IBM 3090. In: Wright, M. (ed.) Proceedings of the IFIP WG 2.5 on Aspects of Computation on Asynchronous Parallel Processors, pp. 217–221. North Holland, Stanford (1988)Google Scholar
  3. 3.
    Agarwal, R.C., Gustavson, F.G., Zubair, M.: Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms. IBM Journal of Research and Development 38(5), 563–576 (1994)CrossRefGoogle Scholar
  4. 4.
    Agarwal, R.C., Gustavson, F.G., Zubair, M.: A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication. IBM Journal of Research and Development 38(6), 673–681 (1994)CrossRefGoogle Scholar
  5. 5.
    Andersen, B.S., Gunnels, J.A., Gustavson, F.G., Reid, J.K., Waśniewski, J.: A Fully Portable High Performance Minimal Storage Hybrid Cholesky Algorithm. ACM TOMS 31(2), 201–227 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    Anderson, E., et al.: LAPACK Users’ Guide Release 3.0. SIAM, Philadelphia (1999)Google Scholar
  7. 7.
    Blackford, L.S., et al.: ScaLAPACK Users’ Guide. SIAM, Philadelphia (1997)Google Scholar
  8. 8.
    Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: A class of parallel tiled linear algorithms for multicore architectures. Parallel Comput. 35(1), 38–53 (2009)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Chatterjee, S., et al.: Design and Exploitation of a High-performance SIMD Floating-point Unit for Blue Gene/L. IBM Journal of Research and Development 49(2-3), 377–391 (2005)CrossRefGoogle Scholar
  10. 10.
    Dongarra, J.J., Du Croz, J., Hammarling, S., Duff, I.: A Set of Level 3 Basic Linear Algebra Subprograms. TOMS 16(1), 1–17 (1990)CrossRefzbMATHGoogle Scholar
  11. 11.
    Elmroth, E., Gustavson, F.G., Jonsson, I., Kågström, B.: Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software. SIAM Review 46(1), 3–45 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Gallivan, K., Jalby, W., Meier, U., Sameh, A.: The Impact of Hierarchical Memory Systems on Linear Algebra Algorithm Design. International Journal of Supercomputer Applications 2(1), 12–48 (1988)CrossRefGoogle Scholar
  13. 13.
    Golub, G., VanLoan, C.: Matrix Computations, 3rd edn. John Hopkins Press, Baltimore (1996)Google Scholar
  14. 14.
    Gustavson, F.G.: Recursion Leads to Automatic Variable Blocking for Dense Linear-Algebra Algorithms. IBM Journal of Research and Development 41(6), 737–755 (1997)CrossRefGoogle Scholar
  15. 15.
    Gustavson, F.G.: High Performance Linear Algebra Algorithms using New Generalized Data Structures for Matrices. IBM Journal of Research and Development 47(1), 31–55 (2003)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Gustavson, F.G., Gunnels, J.A., Sexton, J.C.: Minimal Data Copy for Dense Linear Algebra Factorization. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 540–549. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  17. 17.
    Gustavson, F.G., Swirszcz, T.: In-Place Transposition of Rectangular Matrices. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 560–569. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  18. 18.
    Gustavson, F.G., Gunnels, J., Sexton, J.: Method and Structure for Fast In-Place Transformation of Standard Full and Packed Matrix Data Formats. United State Patent Office Submission YOR920070021US1 and Submission YOR920070021US1(YOR.699CIP) US Patent Office, 35 pages (September 1, 2007); 58 pages (March 2008)Google Scholar
  19. 19.
    Gustavson, F.G.: The Relevance of New Data Structure Approaches for Dense Linear Algebra in the New Multicore/Manycore Environments, IBM Research report RC24599; also, to appear in PARA 2008 proceeding, 10 pages (2008)Google Scholar
  20. 20.
    Gustavson, F.G., Karlsson, L., Kågström, B.: Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion. ACM TOMS, 34 pages (to appear 2011)Google Scholar
  21. 21.
    IBM. IBM Engineering and Scientific Subroutine Library for AIX Version 3, Release 3. IBM Pub. No. SA22-7272-00 (February 1986)Google Scholar
  22. 22.
    Karlsson, L.: Blocked in-place transposition with application to storage format conversion. Tech. Rep. UMINF 09.01. Department of Computing Science, Umeå University, Umeå, Sweden (January 2009) ISSN 0348-0542Google Scholar
  23. 23.
    Knuth, D.: The Art of Computer Programming, 3rd edn., vol. 1 & 2. Addison-Wesley (1998)Google Scholar
  24. 24.
    Kurzak, J., Buttari, A., Dongarra, J.: Solving systems of Linear Equations on the Cell Processor using Cholesky Factorization. IEEE Trans. Parallel Distrib. Syst. 19(9), 1175–1186 (2008)CrossRefGoogle Scholar
  25. 25.
    Kurzak, J., Dongarra, J.: Implementation of mixed precision in solving mixed precision of linear equations on the Cell processor: Research Articles. Concurr. Comput.: Pract. Exper. 19(10), 1371–1385 (2007)CrossRefGoogle Scholar
  26. 26.
    Lao, S., Lewis, B.R., Boucher, M.L.: In-place Transpose United State Patent No. US 7,031,994 B2. US Patent Office (April 18, 2006)Google Scholar
  27. 27.
    Park, N., Hong, B., Prasanna, V.: Tiling, Block Data Layout, and Memory Hierarchy Performance. IEEE Trans. Parallel and Distributed Systems 14(7), 640–654 (2003)CrossRefGoogle Scholar
  28. 28.
    Tietze, H.: Three Dimensions–Higher Dimensions. In: Famous Problems of Mathematics, pp. 106–120. Graylock Press (1965)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Fred G. Gustavson
    • 1
    • 2
  1. 1.IBM T.J. Watson Research CenterEmeritusPoland
  2. 2.Umeå UniversitySweden

Personalised recommendations