How to Write Fast Numerical Code: A Small Introduction

  • Srinivas Chellappa
  • Franz Franchetti
  • Markus Püschel
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5235)


The complexity of modern computing platforms has made it extremely difficult to write numerical code that achieves the best possible performance. Straightforward implementations based on algorithms that minimize the operations count often fall short in performance by at least one order of magnitude. This tutorial introduces the reader to a set of general techniques to improve the performance of numerical code, focusing on optimizations for the computer’s memory hierarchy. Further, program generators are discussed as a way to reduce the implementation and optimization effort. Two running examples are used to demonstrate these techniques: matrix-matrix multiplication and the discrete Fourier transform.


Discrete Fourier Transform Cache Line Memory Hierarchy Operation Count Vector Instruction 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Moore, G.E.: Cramming more components onto integrated circuits. Readings in computer architecture, 56–59 (2000)Google Scholar
  2. 2.
    Meadows, L., Nakamoto, S., Schuster, V.: A vectorizing, software pipelining compiler for LIW and superscalar architecture. In: Proceedings of Risc (1992)Google Scholar
  3. 3.
    Group, S.S.C.: SUIF: A parallelizing & optimizing research compiler. Technical Report CSL-TR-94-620, Computer Systems Laboratory, Stanford University (May 1994)Google Scholar
  4. 4.
    Franke, B., O’Boyle, M.F.P.: A complete compiler approach to auto-parallelizing C programs for multi-DSP systems. IEEE Trans. Parallel Distrib. Syst. 16(3), 234–245 (2005)CrossRefGoogle Scholar
  5. 5.
    Van Loan, C.: Computational Framework of the Fast Fourier Transform. SIAM, Philadelphia (1992)CrossRefzbMATHGoogle Scholar
  6. 6.
    Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T.: Numerical Recipes in C: The Art of Scientific Computing, 2nd edn. Cambridge University Press, Cambridge (1992)zbMATHGoogle Scholar
  7. 7.
    Püschel, M., Moura, J.M.F., Johnson, J., Padua, D., Veloso, M., Singer, B.W., Xiong, J., Franchetti, F., Gačić, A., Voronenko, Y., Chen, K., Johnson, R.W., Rizzolo, N.: SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE Special issue on Program Generation, Optimization, and Adaptation 93(2), 232–275 (2005)Google Scholar
  8. 8.
    Website: Spiral (1998),
  9. 9.
    Frigo, M., Johnson, S.G.: FFTW: An adaptive software architecture for the FFT. In: Proc. IEEE Int’l Conf. Acoustics, Speech, and Signal Processing (ICASSP), vol. 3, pp. 1381–1384 (1998)Google Scholar
  10. 10.
    Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proceedings of the IEEE Special issue on Program Generation, Optimization, and Adaptation 93(2), 216–231 (2005)Google Scholar
  11. 11.
    Website: FFTW,
  12. 12.
    Goto, K., van de Geijn, R.: On reducing TLB misses in matrix multiplication, FLAME working note 9. Technical Report TR-2002-55, The University of Texas at Austin, Department of Computer Sciences (November 2002)Google Scholar
  13. 13.
    Whaley, R.C., Dongarra, J.: Automatically Tuned Linear Algebra Software (ATLAS). In: Proc. Supercomputing (1998)Google Scholar
  14. 14.
    Moura, J.M.F., Püschel, M., Padua, D., Dongarra, J.: Scanning the issue: Special issue on program generation, optimization, and platform adaptation. Proceedings of the IEEE, special issue on Program Generation, Optimization, and Adaptation 93(2), 211–215 (2005)Google Scholar
  15. 15.
    Bida, E., Toledo, S.: An automatically-tuned sorting library. Software: Practice and Experience 37(11), 1161–1192 (2007)Google Scholar
  16. 16.
    Li, X., Garzaran, M.J., Padua, D.: A dynamically tuned sorting library. In: Proc. Int’l Symposium on Code Generation and Optimization (CGO), pp. 111–124 (2004)Google Scholar
  17. 17.
    Im, E.-J., Yelick, K., Vuduc, R.: Sparsity: Optimization framework for sparse matrix kernels. Int’l J. High Performance Computing Applications 18(1), 135–158 (2004)CrossRefGoogle Scholar
  18. 18.
    Demmel, J., Dongarra, J., Eijkhout, V., Fuentes, E., Petitet, A., Vuduc, R., Whaley, C., Yelick, K.: Self adapting linear algebra algorithms and software. Proceedings of the IEEE Special issue on Program Generation, Optimization, and Adaptation 93(2), 293–312 (2005)Google Scholar
  19. 19.
  20. 20.
    Vuduc, R., Demmel, J.W., Yelick, K.A.: OSKI: A library of automatically tuned sparse matrix kernels. In: Proc. SciDAC. Journal of Physics: Conference Series, vol. 16, pp. 521–530 (2005)Google Scholar
  21. 21.
    Whaley, R., Petitet, A., Dongarra, J.: Automated empirical optimization of software and the ATLAS project. Parallel Computing 27(1-2), 3–35 (2001)CrossRefzbMATHGoogle Scholar
  22. 22.
    Bilmes, J., Asanović, K., whye Chin, C., Demmel, J.: Optimizing matrix multiply using PHiPAC: a Portable, High-Performance, ANSI C coding methodology. In: Proc. Int’l Conference on Supercomputing (ICS), pp. 340–347 (1997)Google Scholar
  23. 23.
    Frigo, M.: A fast Fourier transform compiler. In: Proc. Programming Language Design and Implementation (PLDI), pp. 169–180 (1999)Google Scholar
  24. 24.
    Franchetti, F., Voronenko, Y., Püschel, M.: Formal loop merging for signal transforms. In: Proc. Programming Language Design and Implementation (PLDI), pp. 315–326 (2005)Google Scholar
  25. 25.
    Franchetti, F., Voronenko, Y., Püschel, M.: FFT program generation for shared memory: SMP and multicore. In: Proc. Supercomputing (2006)Google Scholar
  26. 26.
    Franchetti, F., Voronenko, Y., Püschel, M.: A rewriting system for the vectorization of signal transforms. In: Daydé, M., Palma, J.M.L.M., Coutinho, Á.L.G.A., Pacitti, E., Lopes, J.C. (eds.) VECPAR 2006. LNCS, vol. 4395. Springer, Heidelberg (2006)Google Scholar
  27. 27.
    Bientinesi, P., Gunnels, J.A., Myers, M.E., Quintana-Orti, E., van de Geijn, R.: The science of deriving dense linear algebra algorithms. ACM Trans. on Mathematical Software 31(1), 1–26 (2005)CrossRefMathSciNetzbMATHGoogle Scholar
  28. 28.
    Gunnels, J.A., Gustavson, F.G., Henry, G.M., van de Geijn, R.A.: FLAME: Formal linear algebra methods environment. ACM Trans. on Mathematical Software 27(4), 422–455 (2001)CrossRefzbMATHGoogle Scholar
  29. 29.
    Quintana-Orti, G., Quintana-Orti, E.S., van de Geijn, R., Van Zee, F.G., Chan, E.: Programming algorithms-by-blocks for matrix computations on multithreaded architectures (submitted for publication)Google Scholar
  30. 30.
    Baumgartner, G., Auer, A., Bernholdt, D.E., Bibireata, A., Choppella, V., Cociorva, D., Gao, X., Harrison, R.J., Hirata, S., Krishanmoorthy, S., Krishnan, S., Lam, C.C., Lu, Q., Nooijen, M., Pitzer, R.M., Ramanujam, J., Sadayappan, P., Sibiryakov, A.: Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proceedings of the IEEE 93(2), 276–292 (2005); Special issue on Program Generation, Optimization, and AdaptationCrossRefGoogle Scholar
  31. 31.
    Czarnecki, K., Eisenecker, U.: Generative Programming: Methods, Tools, and Applications. Addison-Wesley, Reading (2000)Google Scholar
  32. 32.
    Lämmel, R., Saraiva, J., Visser, J. (eds.): GTTSE 2005. LNCS, vol. 4143. Springer, Heidelberg (2006)Google Scholar
  33. 33.
    Püschel, M.: How to write fast code.Course 18-645, Electrical and Computer Engineering, Carnegie Mellon University (2008),
  34. 34.
    Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C. (eds.): Introduction to algorithms. MIT Press, Cambridge (2001)zbMATHGoogle Scholar
  35. 35.
    Demmel, J.W.: Applied numerical linear algebra. SIAM, Philadelphia (1997)CrossRefzbMATHGoogle Scholar
  36. 36.
    Tolimieri, R., An, M., Lu, C.: Algorithms for discrete Fourier transforms and convolution, 2nd edn. Springer, Heidelberg (1997)CrossRefzbMATHGoogle Scholar
  37. 37.
    Hennessy, J.L., Patterson, D.: Computer Architecture: A Quantitative Approach. Morgan Kaufmann, San Francisco (2002)zbMATHGoogle Scholar
  38. 38.
    Bryant, R.E., O’Hallaron, D.R.: Computer Systems: A Programmer’s Perspective. Prentice-Hall, Englewood Cliffs (2003)Google Scholar
  39. 39.
    Strassen, V.: Gaussian elimination is not optimal. Numerische Mathematik 14(3), 354–356 (1969)CrossRefMathSciNetzbMATHGoogle Scholar
  40. 40.
    Coppersmith, D., Winograd, S.: Matrix multiplication via arithmetic progressions. Journal of Symbolic Computation 9, 251–280 (1990)CrossRefMathSciNetzbMATHGoogle Scholar
  41. 41.
    Blackford, L.S., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux, M., Kaufman, L., Lumsdaine, A., Petitet, A., Pozo, R., Remington, K., Whaley, R.C.: An updated set of Basic Linear Algebra Subprograms (BLAS). ACM Trans. on Mathematical Software 28(2), 135–151 (2002)CrossRefMathSciNetGoogle Scholar
  42. 42.
    Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Croz, J.D., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide, 3rd edn. SIAM, Philadelphia (1999)CrossRefzbMATHGoogle Scholar
  43. 43.
  44. 44.
  45. 45.
  46. 46.
  47. 47.
    Blackford, L.S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia (1997)CrossRefzbMATHGoogle Scholar
  48. 48.
  49. 49.
    Chtchelkanova, A., Gunnels, J., Morrow, G., Overfelt, J., van de Geijn, R.: Parallel implementation of BLAS: General techniques for level 3 BLAS. Concurrency: Practice and Experience 9(9), 837–857 (1997)CrossRefGoogle Scholar
  50. 50.
  51. 51.
    Johnson, S.G., Frigo, M.: A modified split-radix FFT with fewer arithmetic operations. IEEE Trans. Signal Processing 55(1), 111–119 (2007)CrossRefMathSciNetGoogle Scholar
  52. 52.
    Nussbaumer, H.J.: Fast Fourier Transformation and Convolution Algorithms, 2nd edn. Springer, Heidelberg (1982)CrossRefGoogle Scholar
  53. 53.
    Johnson, J.R., Johnson, R.W., Rodriguez, D., Tolimieri, R.: A methodology for designing, modifying, and implementing FFT algorithms on various architectures. Circuits Systems Signal Processing 9(4), 449–500 (1990)CrossRefMathSciNetzbMATHGoogle Scholar
  54. 54.
    Franchetti, F., Püschel, M.: Short vector code generation for the discrete Fourier transform. In: Proc. IEEE Int’l Parallel and Distributed Processing Symposium (IPDPS), pp. 58–67 (2003)Google Scholar
  55. 55.
    Bonelli, A., Franchetti, F., Lorenz, J., Püschel, M., Ueberhuber, C.W.: Automatic performance optimization of the discrete Fourier transform on distributed memory computers. In: Guo, M., Yang, L.T., Di Martino, B., Zima, H.P., Dongarra, J., Tang, F. (eds.) ISPA 2006. LNCS, vol. 4330. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  56. 56.
  57. 57.
  58. 58.
    Mirković, D., Johnsson, S.L.: Automatic performance tuning in the UHFFT library. In: Alexandrov, V.N., Dongarra, J., Juliano, B.A., Renner, R.S., Tan, C.J.K. (eds.) ICCS-ComputSci 2001. LNCS, vol. 2073, pp. 71–80. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  59. 59.
  60. 60.
    Website: FFTE,
  61. 61.
  62. 62.
  63. 63.
  64. 64.
    Website, I.B.M.: ESSL and PESSL,
  65. 65.
    Website: NAG,
  66. 66.
  67. 67.
    Hill, M.D., Smith, A.J.: Evaluating associativity in CPU caches. IEEE Trans. Comput. 38(12), 1612–1630 (1989)CrossRefGoogle Scholar
  68. 68.
    Intel Corporation: Intel 64 and IA-32 Architectures Optimization Reference Manual (2007),
  69. 69.
    Advanced Micro Devices (AMD) Inc.: Software Optimization Guide for AMD Athlon 64 and AMD Optero Processors (2005),
  70. 70.
  71. 71.
    Intel: Quick-reference guide to optimization with intel compilers version 10.x,
  72. 72.
    Intel: Intel VTuneGoogle Scholar
  73. 73.
    Microsoft: Microsoft Visual StudioGoogle Scholar
  74. 74.
  75. 75.
    Yotov, K., Li, X., Ren, G., Garzaran, M.J., Padua, D., Pingali, K., Stodghill, P.: Is search really necessary to generate high-performance BLAS? Proceedings of the IEEE Special issue on Program Generation, Optimization, and Adaptation 93(2), 358–386 (2005)Google Scholar
  76. 76.
    Wolfe, M.: Iteration space tiling for memory hierarchies. In: SIAM Conference on Parallel Processing for Scientific Computing (1987)Google Scholar
  77. 77.
    Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex Fourier series. Math. of Computation 19, 297–301 (1965)CrossRefMathSciNetzbMATHGoogle Scholar
  78. 78.
    Püschel, M., Singer, B., Xiong, J., Moura, J.M.F., Johnson, J., Padua, D., Veloso, M., Johnson, R.W.: SPIRAL: A generator for platform-adapted libraries of signal processing algorithms. Int’l Journal of High Performance Computing Applications 18(1), 21–45 (2004)CrossRefGoogle Scholar
  79. 79.
    D’Alberto, P., Milder, P.A., Sandryhaila, A., Franchetti, F., Hoe, J.C., Moura, J.M.F., Püschel, M., Johnson, J.: Generating FPGA accelerated DFT libraries. In: Proc. Symposium on Field-Programmable Custom Computing Machines (FCCM) (2007)Google Scholar
  80. 80.
    Milder, P.A., Franchetti, F., Hoe, J.C., Püschel, M.: Formal datapath representation and manipulation for implementing DSP transforms. In: Proc. Design Automation Conference (DAC) (2008)Google Scholar
  81. 81.
    Xiong, J., Johnson, J., Johnson, R., Padua, D.: SPL: A language and compiler for DSP algorithms. In: Proc. Programming Language Design and Implementation (PLDI), pp. 298–308 (2001)Google Scholar
  82. 82.
    Dershowitz, N., Plaisted, D.A.: Rewriting. In: Handbook of Automated Reasoning, vol. 1, pp. 535–610. Elsevier, Amsterdam (2001)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Srinivas Chellappa
    • 1
  • Franz Franchetti
    • 1
  • Markus Püschel
    • 1
  1. 1.Electrical and Computer EngineeringCarnegie Mellon UniversityUSA

Personalised recommendations