Skip to main content

Toward a Theory of Algorithm-Architecture Co-design

  • Conference paper
High Performance Computing for Computational Science - VECPAR 2012 (VECPAR 2012)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7851))

  • 2064 Accesses

Abstract

We are carrying out a research program that asks whether there is a useful mathematical framework for reasoning at a high-level about the behavior of an algorithm on a supercomputer with respect to the physical constraints of energy, power, and die area. By “high-level,” we mean that we wish to explicitly relate characteristics of an algorithm, such as its inherent parallelism or memory and communication behavior, with parameters of an architecture, such as the number of cores, structure of the memory hierarchy, or network topology. Our ultimate goal is to say, in broad but also quantitative terms, how macroscopic changes to an architecture might affect the execution time, scalability, accuracy, and power-efficiency of a computation; and, conversely, identify what classes of computation might best match a given architecture. The approach we shall outline marries abstract algorithmic complexity analysis with caps on power and die area, which are arguably the central first-order constraints on the extremescale systems of 2018 and beyond [1, 16, 21, 29, 41]. We refer to our approach as one of algorithm-architecture co-design.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. The National Academies Press, Washington, DC (2008)

    Google Scholar 

  2. Arge, L., Goodrich, M.T., Nelson, M., Sitchinava, N.: Fundamental parallel algorithms for private-cache chip multiprocessors. In: Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures, SPAA 2008, p. 197. ACM Press, New York (2008)

    Chapter  Google Scholar 

  3. Badia, R.M., Rodriguez, G., Labarta, J.: Deriving analytical models from a limited number of runs. In: Proceedings of Parallel Computing, ParCo, Minisymposium on Performance Analysis, pp. 1–6 (2003)

    Google Scholar 

  4. Barker, K., Benner, A., Hoare, R., Hoisie, A., Jones, A., Kerbyson, D., Li, D., Melhem, R., Rajamony, R., Schenfeld, E., Shao, S., Stunkel, C., Walker, P.: On the Feasibility of Optical Circuit Switching for High Performance Computing Systems. In: ACM/IEEE SC 2005 Conference, SC 2005. IEEE (2005), http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1559968&tag=1

  5. Barker, K.J., Hoisie, A., Kerbyson, D.J.: An early performance analysis of POWER7-IH HPC systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on SC 2011, p. 1. ACM Press, New York (2011)

    Google Scholar 

  6. Blelloch, G.E.: Programming parallel algorithms. Communications of the ACM 39(3), 85–97 (1996)

    Article  Google Scholar 

  7. Blelloch, G.E., Gibbons, P.B., Simhadri, H.V.: Low depth cache-oblivious algorithms. In: Proc. ACM Symp. Parallel Algorithms and Architectures, SPAA, Santorini, Greece (June 2010)

    Google Scholar 

  8. Carrington, L., Snavely, A., Wolter, N.: A performance prediction framework for scientific applications. Future Generation Computer Systems 22(3), 336–346 (2006)

    Article  Google Scholar 

  9. Casas, M., Badia, R.M., Labarta, J.: Prediction of behavior of MPI applications. In: 2008 IEEE International Conference on Cluster Computing, pp. 242–251. IEEE (September 2008)

    Google Scholar 

  10. Chandramowlishwaran, A., Choi, J.W., Madduri, K., Vuduc, R.: Towards a communication optimal fast multipole method and its implications for exascale. In: Proc. ACM Symp. Parallel Algorithms and Architectures, pp. 182–184. ACM, New York (2012), http://dl.acm.org/citation.cfm?id=2312039

  11. Chowdhury, R.A., Silvestri, F., Blakeley, B., Ramachandran, V.: Oblivious algorithms for multicores and network of processors. In: 2010 IEEE International Symposium on Parallel & Distributed Processing, IPDPS, pp. 1–12. IEEE (2010)

    Google Scholar 

  12. Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K.E., Santos, E., Subramonian, R., von Eicken, T.: LogP: Towards a realistic model of parallel computation. ACM SIGPLAN Notices 28(7), 1–12 (1993)

    Article  Google Scholar 

  13. Czechowski, K., McClanahan, C., Battaglino, C., Iyer, K., Yeung, P.-K., Vuduc, R.: On the communication complexity of 3D FFTs and its implications for exascale. In: Proc. ACM Int’l. Conf. Supercomputing, ICS, San Servolo Island, Venice, Italy (June 2012) (to appear)

    Google Scholar 

  14. Demmel, J.W.: Applied Numerical Linear Algebra. SIAM (1997)

    Google Scholar 

  15. Desprez, F., Markomanolis, G.S., Quinson, M., Suter, F.: Assessing the Performance of MPI Applications through Time-Independent Trace Replay. In: 2011 40th International Conference on Parallel Processing Workshops, pp. 467–476. IEEE (September 2011)

    Google Scholar 

  16. Dongarra, J., Beckman, P., Aerts, P., Cappello, F., Lippert, T., Matsuoka, S., Messina, P., Moore, T., Stevens, R., Trefethen, A., Valero, M.: The International Exascale Software Project: A call to cooperative action by the global high performance community. In: Int’l. J. High-Performance Computing Applications, IJHPCA, vol. 23(4), pp. 309–322 (2009), http://hpc.sagepub.com/content/23/4/309

  17. Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: Proc. Symp. Foundations of Computer Science, FOCS, New York, NY, USA, pp. 285–297 (October 1999)

    Google Scholar 

  18. Ghosh, S., Martonosi, M., Malik, S.: Cache miss equations: A compiler framework for analyzing and tuning memory behavior. ACM Trans. Programming Languages and Systems (TOPLAS) 21(4), 703–746 (1999)

    Article  Google Scholar 

  19. Gonzalez, J., Gimenez, J., Casas, M., Moreto, M., Ramirez, A., Labarta, J., Valero, M.: Simulating Whole Supercomputer Applications. IEEE Micro 31(3), 32–45 (2011)

    Article  Google Scholar 

  20. Guz, Z., Bolotin, E., Keidar, I., Kolodny, A., Mendelson, A., Weiser, U.: Many-Core vs. Many-Thread Machines: Stay Away From the Valley. IEEE Computer Architecture Letters 8(1), 25–28 (2009)

    Article  Google Scholar 

  21. Hemmert, K.S., Vetter, J.S., Bergman, K., Das, C., Emami, A., Janssen, C., Panda, D.K., Stunkel, C., Underwood, K., Yalamanchili, S.: IAA Interconnection Networks Workshop 2008. Technical Report FTGTR-2009-03, Future Technologies Group, Oak Ridge National Laboratory (April 2009), http://ft.ornl.gov/pubs-archive/iaa-ic-2008-workshop-report-final.pdf

  22. Hill, M.D., Marty, M.R.: Amdahl’s Law in the multicore era. IEEE Computer 41(7), 33–38 (2008)

    Article  Google Scholar 

  23. Hoefler, T., Schneider, T., Lumsdaine, A.: LogGOPSim: Simulating large-scale applications in the LoGOPS model. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC 2010, p. 597. ACM Press, New York (2010)

    Chapter  Google Scholar 

  24. Hoisie, A., Johnson, G., Kerbyson, D.J., Lang, M., Pakin, S.: A performance comparison through benchmarking and modeling of three leading supercomputers: Blue Gene/L, Red Storm, and Purple. In: Proc. ACM/IEEE Conf. Supercomputing, SC, number 74, Tampa, FL, USA (November 2006)

    Google Scholar 

  25. Jagode, H., Knupfer, A., Dongarra, J., Jurenz, M., Mueller, M.S., Nagel, W.E.: Trace-based performance analysis for the petascale simulation code FLASH. International Journal of High Performance Computing Applications (December 2010)

    Google Scholar 

  26. Kerbyson, D.J., Alme, H.J., Hoisie, A., Petrini, F., Wasserman, H.J., Gittings, M.: Predictive performance and scalability modeling of a large-scale application. In: Proceedings of the 2001 ACM/IEEE Conference on Supercomputing (CDROM) - Supercomputing 2001, p. 37. ACM Press, New York (2001)

    Chapter  Google Scholar 

  27. Kerbyson, D.J., Hoisie, A., Wasserman, H.: Modelling the performance of large-scale systems. In: IEE Proceedings–Software, vol. 150, pp. 214–221 (August 2003)

    Google Scholar 

  28. Kerbyson, D.J., Jones, P.W.: A Performance Model of the Parallel Ocean Program. International Journal of High Performance Computing Applications 19(3), 261–276 (2005)

    Article  Google Scholar 

  29. Kogge, P., Bergman, K., Borkar, S., Campbell, D., Carlson, W., Dally, W., Denneau, M., Franzon, P., Harrod, W., Hill, K., Hiller, J., Karp, S., Keckler, S., Klein, D., Lucas, R., Richards, M., Scarpelli, A., Scott, S., Snavely, A., Sterling, T., Williams, R.S., Yelick, K.: Exascale Computing Study: Technology challenges in acheiving exascale systems (September 2008), http://users.ece.gatech.edu/~mrichard/ExascaleComputingStudyReports/ECS_reports.htm

  30. Kung, H.: Let’s design algorithms for VLSI systems. In: Proceedings of the Caltech Conference on VLSI: Architecture, Design, and Fabrication, pp. 65–90 (1979)

    Google Scholar 

  31. Lengauer, T.: VLSI theory. In: Handbook of Theoretical Computer Science, ch. 16, pp. 837–865. Elsevier Science Publishers G.V. (1990)

    Google Scholar 

  32. Lively, C.W., Taylor, V.E., Alam, S.R., Vetter, J.S.: A methodology for developing high fidelity communication models for large-scale applications targeted on multicore systems. In: Proc. Int’l. Symp. Computer Architecture and High Performance Computing, SBAC-PAD, Mato Grosso do Sul, Brazil, pp. 55–62 (October 2008)

    Google Scholar 

  33. Mandel, J., Parter, S.V.: On the multigrid F-cycle. Applied Mathematics and Computation 37(1), 19–36 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  34. Numrich, R.W.: Computational force: A unifying concept for scalability analysis. In: Advances in Parallel Computing, vol. 15. IOS Press (2008)

    Google Scholar 

  35. Numrich, R.W.: A metric space for computer programs and the Principle of Computational Least Action. J. Supercomputing 43(3), 281–298 (2008)

    Article  Google Scholar 

  36. Numrich, R.W., Heroux, M.A.: Self-similarity of parallel machines. Parallel Computing 37(2), 69–84 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  37. Rodrigues, A.F., et al.: The structural simulation toolkit. ACM SIGMETRICS Performance Evaluation Review 38(4), 37 (2011)

    Article  Google Scholar 

  38. Rosenberg, A.L.: Three-Dimensional VLSI: a case study. Journal of the ACM 30(3), 397–416 (1983)

    Article  MATH  Google Scholar 

  39. Rosenfeld, P., Cooper-Balis, E., Jacob, B.: DRAMSim2: A Cycle Accurate Memory System Simulator. IEEE Computer Architecture Letters 10(1), 16–19 (2011)

    Article  Google Scholar 

  40. Savage, J.E.: Models of Computation: Exploring the power of computing. CC-3.0, BY-NC-ND, electronic edition (2008)

    Google Scholar 

  41. Simon, H., Zacharia, T., Stevens, R.: Modeling and simulation at the exascale for energy and the environment. Technical report, Office of Science, U.S. Dept. of Energy (May 2008), http://www.sc.doe.gov/ascr/ProgramDocuments/Docs/TownHall.pdf

  42. Snavely, A., Wolter, N., Carrington, L.: Modeling application performance by convolving machine signatures with application profiles. In: Proceedings of the Fourth Annual IEEE International Workshop on Workload Characterization, WWC-4 (Cat. No.01EX538), pp. 149–156. IEEE

    Google Scholar 

  43. Thompson, C.D.: Area-time complexity for VLSI. In: Proceedings of the Eleventh Annual ACM Symposium on Theory of Computing, STOC 1979, pp. 81–88. ACM Press, New York (1979)

    Chapter  Google Scholar 

  44. Toledo, S.: Locality of reference in LU decomposition with partial pivoting. SIAM J. Matrix Anal. Appl. 18(4), 1065–1081 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  45. Valiant, L.G.: A bridging model for parallel computation. Communications of the ACM 33(8), 103–111 (1990)

    Article  Google Scholar 

  46. Valiant, L.G.: A bridging model for multi-core computing. In: Halperin, D., Mehlhorn, K. (eds.) ESA 2008. LNCS, vol. 5193, pp. 13–28. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  47. van Gemund, A.J.: Symbolic performance modeling of parallel systems. IEEE Transactions on Parallel and Distributed Systems 54(7), 922–927 (2005)

    Google Scholar 

  48. Wickremesinghe, R., Arge, L., Chase, J.S., Vitter, J.S.: Efficient sorting using registers and caches. J. Experimental Algorithmics (JEA) 7, 9 (2002)

    Article  MathSciNet  Google Scholar 

  49. Woo, D.H., Lee, H.-H.S.: Extending Amdahl’s Law for energy-efficient computing in the many-core era. IEEE Computer 41(12), 24–31 (2008)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Vuduc, R., Czechowski, K. (2013). Toward a Theory of Algorithm-Architecture Co-design. In: Daydé, M., Marques, O., Nakajima, K. (eds) High Performance Computing for Computational Science - VECPAR 2012. VECPAR 2012. Lecture Notes in Computer Science, vol 7851. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38718-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-38718-0_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-38717-3

  • Online ISBN: 978-3-642-38718-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics