Toward a Theory of Algorithm-Architecture Co-design

Vuduc, Richard; Czechowski, Kenneth

doi:10.1007/978-3-642-38718-0_2

Richard Vuduc¹⁹ &
Kenneth Czechowski¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7851))

Included in the following conference series:

International Conference on High Performance Computing for Computational Science

2064 Accesses

Abstract

We are carrying out a research program that asks whether there is a useful mathematical framework for reasoning at a high-level about the behavior of an algorithm on a supercomputer with respect to the physical constraints of energy, power, and die area. By “high-level,” we mean that we wish to explicitly relate characteristics of an algorithm, such as its inherent parallelism or memory and communication behavior, with parameters of an architecture, such as the number of cores, structure of the memory hierarchy, or network topology. Our ultimate goal is to say, in broad but also quantitative terms, how macroscopic changes to an architecture might affect the execution time, scalability, accuracy, and power-efficiency of a computation; and, conversely, identify what classes of computation might best match a given architecture. The approach we shall outline marries abstract algorithmic complexity analysis with caps on power and die area, which are arguably the central first-order constraints on the extremescale systems of 2018 and beyond [1, 16, 21, 29, 41]. We refer to our approach as one of algorithm-architecture co-design.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. The National Academies Press, Washington, DC (2008)
Google Scholar
Arge, L., Goodrich, M.T., Nelson, M., Sitchinava, N.: Fundamental parallel algorithms for private-cache chip multiprocessors. In: Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures, SPAA 2008, p. 197. ACM Press, New York (2008)
Chapter Google Scholar
Badia, R.M., Rodriguez, G., Labarta, J.: Deriving analytical models from a limited number of runs. In: Proceedings of Parallel Computing, ParCo, Minisymposium on Performance Analysis, pp. 1–6 (2003)
Google Scholar
Barker, K., Benner, A., Hoare, R., Hoisie, A., Jones, A., Kerbyson, D., Li, D., Melhem, R., Rajamony, R., Schenfeld, E., Shao, S., Stunkel, C., Walker, P.: On the Feasibility of Optical Circuit Switching for High Performance Computing Systems. In: ACM/IEEE SC 2005 Conference, SC 2005. IEEE (2005), http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1559968&tag=1
Barker, K.J., Hoisie, A., Kerbyson, D.J.: An early performance analysis of POWER7-IH HPC systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on SC 2011, p. 1. ACM Press, New York (2011)
Google Scholar
Blelloch, G.E.: Programming parallel algorithms. Communications of the ACM 39(3), 85–97 (1996)
Article Google Scholar
Blelloch, G.E., Gibbons, P.B., Simhadri, H.V.: Low depth cache-oblivious algorithms. In: Proc. ACM Symp. Parallel Algorithms and Architectures, SPAA, Santorini, Greece (June 2010)
Google Scholar
Carrington, L., Snavely, A., Wolter, N.: A performance prediction framework for scientific applications. Future Generation Computer Systems 22(3), 336–346 (2006)
Article Google Scholar
Casas, M., Badia, R.M., Labarta, J.: Prediction of behavior of MPI applications. In: 2008 IEEE International Conference on Cluster Computing, pp. 242–251. IEEE (September 2008)
Google Scholar
Chandramowlishwaran, A., Choi, J.W., Madduri, K., Vuduc, R.: Towards a communication optimal fast multipole method and its implications for exascale. In: Proc. ACM Symp. Parallel Algorithms and Architectures, pp. 182–184. ACM, New York (2012), http://dl.acm.org/citation.cfm?id=2312039
Chowdhury, R.A., Silvestri, F., Blakeley, B., Ramachandran, V.: Oblivious algorithms for multicores and network of processors. In: 2010 IEEE International Symposium on Parallel & Distributed Processing, IPDPS, pp. 1–12. IEEE (2010)
Google Scholar
Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K.E., Santos, E., Subramonian, R., von Eicken, T.: LogP: Towards a realistic model of parallel computation. ACM SIGPLAN Notices 28(7), 1–12 (1993)
Article Google Scholar
Czechowski, K., McClanahan, C., Battaglino, C., Iyer, K., Yeung, P.-K., Vuduc, R.: On the communication complexity of 3D FFTs and its implications for exascale. In: Proc. ACM Int’l. Conf. Supercomputing, ICS, San Servolo Island, Venice, Italy (June 2012) (to appear)
Google Scholar
Demmel, J.W.: Applied Numerical Linear Algebra. SIAM (1997)
Google Scholar
Desprez, F., Markomanolis, G.S., Quinson, M., Suter, F.: Assessing the Performance of MPI Applications through Time-Independent Trace Replay. In: 2011 40th International Conference on Parallel Processing Workshops, pp. 467–476. IEEE (September 2011)
Google Scholar
Dongarra, J., Beckman, P., Aerts, P., Cappello, F., Lippert, T., Matsuoka, S., Messina, P., Moore, T., Stevens, R., Trefethen, A., Valero, M.: The International Exascale Software Project: A call to cooperative action by the global high performance community. In: Int’l. J. High-Performance Computing Applications, IJHPCA, vol. 23(4), pp. 309–322 (2009), http://hpc.sagepub.com/content/23/4/309
Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: Proc. Symp. Foundations of Computer Science, FOCS, New York, NY, USA, pp. 285–297 (October 1999)
Google Scholar
Ghosh, S., Martonosi, M., Malik, S.: Cache miss equations: A compiler framework for analyzing and tuning memory behavior. ACM Trans. Programming Languages and Systems (TOPLAS) 21(4), 703–746 (1999)
Article Google Scholar
Gonzalez, J., Gimenez, J., Casas, M., Moreto, M., Ramirez, A., Labarta, J., Valero, M.: Simulating Whole Supercomputer Applications. IEEE Micro 31(3), 32–45 (2011)
Article Google Scholar
Guz, Z., Bolotin, E., Keidar, I., Kolodny, A., Mendelson, A., Weiser, U.: Many-Core vs. Many-Thread Machines: Stay Away From the Valley. IEEE Computer Architecture Letters 8(1), 25–28 (2009)
Article Google Scholar
Hemmert, K.S., Vetter, J.S., Bergman, K., Das, C., Emami, A., Janssen, C., Panda, D.K., Stunkel, C., Underwood, K., Yalamanchili, S.: IAA Interconnection Networks Workshop 2008. Technical Report FTGTR-2009-03, Future Technologies Group, Oak Ridge National Laboratory (April 2009), http://ft.ornl.gov/pubs-archive/iaa-ic-2008-workshop-report-final.pdf
Hill, M.D., Marty, M.R.: Amdahl’s Law in the multicore era. IEEE Computer 41(7), 33–38 (2008)
Article Google Scholar
Hoefler, T., Schneider, T., Lumsdaine, A.: LogGOPSim: Simulating large-scale applications in the LoGOPS model. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC 2010, p. 597. ACM Press, New York (2010)
Chapter Google Scholar
Hoisie, A., Johnson, G., Kerbyson, D.J., Lang, M., Pakin, S.: A performance comparison through benchmarking and modeling of three leading supercomputers: Blue Gene/L, Red Storm, and Purple. In: Proc. ACM/IEEE Conf. Supercomputing, SC, number 74, Tampa, FL, USA (November 2006)
Google Scholar
Jagode, H., Knupfer, A., Dongarra, J., Jurenz, M., Mueller, M.S., Nagel, W.E.: Trace-based performance analysis for the petascale simulation code FLASH. International Journal of High Performance Computing Applications (December 2010)
Google Scholar
Kerbyson, D.J., Alme, H.J., Hoisie, A., Petrini, F., Wasserman, H.J., Gittings, M.: Predictive performance and scalability modeling of a large-scale application. In: Proceedings of the 2001 ACM/IEEE Conference on Supercomputing (CDROM) - Supercomputing 2001, p. 37. ACM Press, New York (2001)
Chapter Google Scholar
Kerbyson, D.J., Hoisie, A., Wasserman, H.: Modelling the performance of large-scale systems. In: IEE Proceedings–Software, vol. 150, pp. 214–221 (August 2003)
Google Scholar
Kerbyson, D.J., Jones, P.W.: A Performance Model of the Parallel Ocean Program. International Journal of High Performance Computing Applications 19(3), 261–276 (2005)
Article Google Scholar
Kogge, P., Bergman, K., Borkar, S., Campbell, D., Carlson, W., Dally, W., Denneau, M., Franzon, P., Harrod, W., Hill, K., Hiller, J., Karp, S., Keckler, S., Klein, D., Lucas, R., Richards, M., Scarpelli, A., Scott, S., Snavely, A., Sterling, T., Williams, R.S., Yelick, K.: Exascale Computing Study: Technology challenges in acheiving exascale systems (September 2008), http://users.ece.gatech.edu/~mrichard/ExascaleComputingStudyReports/ECS_reports.htm
Kung, H.: Let’s design algorithms for VLSI systems. In: Proceedings of the Caltech Conference on VLSI: Architecture, Design, and Fabrication, pp. 65–90 (1979)
Google Scholar
Lengauer, T.: VLSI theory. In: Handbook of Theoretical Computer Science, ch. 16, pp. 837–865. Elsevier Science Publishers G.V. (1990)
Google Scholar
Lively, C.W., Taylor, V.E., Alam, S.R., Vetter, J.S.: A methodology for developing high fidelity communication models for large-scale applications targeted on multicore systems. In: Proc. Int’l. Symp. Computer Architecture and High Performance Computing, SBAC-PAD, Mato Grosso do Sul, Brazil, pp. 55–62 (October 2008)
Google Scholar
Mandel, J., Parter, S.V.: On the multigrid F-cycle. Applied Mathematics and Computation 37(1), 19–36 (1990)
Article MathSciNet MATH Google Scholar
Numrich, R.W.: Computational force: A unifying concept for scalability analysis. In: Advances in Parallel Computing, vol. 15. IOS Press (2008)
Google Scholar
Numrich, R.W.: A metric space for computer programs and the Principle of Computational Least Action. J. Supercomputing 43(3), 281–298 (2008)
Article Google Scholar
Numrich, R.W., Heroux, M.A.: Self-similarity of parallel machines. Parallel Computing 37(2), 69–84 (2011)
Article MathSciNet MATH Google Scholar
Rodrigues, A.F., et al.: The structural simulation toolkit. ACM SIGMETRICS Performance Evaluation Review 38(4), 37 (2011)
Article Google Scholar
Rosenberg, A.L.: Three-Dimensional VLSI: a case study. Journal of the ACM 30(3), 397–416 (1983)
Article MATH Google Scholar
Rosenfeld, P., Cooper-Balis, E., Jacob, B.: DRAMSim2: A Cycle Accurate Memory System Simulator. IEEE Computer Architecture Letters 10(1), 16–19 (2011)
Article Google Scholar
Savage, J.E.: Models of Computation: Exploring the power of computing. CC-3.0, BY-NC-ND, electronic edition (2008)
Google Scholar
Simon, H., Zacharia, T., Stevens, R.: Modeling and simulation at the exascale for energy and the environment. Technical report, Office of Science, U.S. Dept. of Energy (May 2008), http://www.sc.doe.gov/ascr/ProgramDocuments/Docs/TownHall.pdf
Snavely, A., Wolter, N., Carrington, L.: Modeling application performance by convolving machine signatures with application profiles. In: Proceedings of the Fourth Annual IEEE International Workshop on Workload Characterization, WWC-4 (Cat. No.01EX538), pp. 149–156. IEEE
Google Scholar
Thompson, C.D.: Area-time complexity for VLSI. In: Proceedings of the Eleventh Annual ACM Symposium on Theory of Computing, STOC 1979, pp. 81–88. ACM Press, New York (1979)
Chapter Google Scholar
Toledo, S.: Locality of reference in LU decomposition with partial pivoting. SIAM J. Matrix Anal. Appl. 18(4), 1065–1081 (1997)
Article MathSciNet MATH Google Scholar
Valiant, L.G.: A bridging model for parallel computation. Communications of the ACM 33(8), 103–111 (1990)
Article Google Scholar
Valiant, L.G.: A bridging model for multi-core computing. In: Halperin, D., Mehlhorn, K. (eds.) ESA 2008. LNCS, vol. 5193, pp. 13–28. Springer, Heidelberg (2008)
Chapter Google Scholar
van Gemund, A.J.: Symbolic performance modeling of parallel systems. IEEE Transactions on Parallel and Distributed Systems 54(7), 922–927 (2005)
Google Scholar
Wickremesinghe, R., Arge, L., Chase, J.S., Vitter, J.S.: Efficient sorting using registers and caches. J. Experimental Algorithmics (JEA) 7, 9 (2002)
Article MathSciNet Google Scholar
Woo, D.H., Lee, H.-H.S.: Extending Amdahl’s Law for energy-efficient computing in the many-core era. IEEE Computer 41(12), 24–31 (2008)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Georgia Institute of Technology, Atlanta, GA, 30332-0765, USA
Richard Vuduc & Kenneth Czechowski

Authors

Richard Vuduc
View author publications
You can also search for this author in PubMed Google Scholar
Kenneth Czechowski
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

INPT (ENSEEIHT) - IRIT, University of Toulouse, 31062, Toulouse, France
Michel Daydé
Lawrence Berkeley National Laboratory, 94720-8139, Berkeley, CA, USA
Osni Marques
Information Technology Center, The University of Tokyo, 113-8658, Tokyo, Japan
Kengo Nakajima

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vuduc, R., Czechowski, K. (2013). Toward a Theory of Algorithm-Architecture Co-design. In: Daydé, M., Marques, O., Nakajima, K. (eds) High Performance Computing for Computational Science - VECPAR 2012. VECPAR 2012. Lecture Notes in Computer Science, vol 7851. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38718-0_2

Download citation

DOI: https://doi.org/10.1007/978-3-642-38718-0_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38717-3
Online ISBN: 978-3-642-38718-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics