Abstract
We are carrying out a research program that asks whether there is a useful mathematical framework for reasoning at a high-level about the behavior of an algorithm on a supercomputer with respect to the physical constraints of energy, power, and die area. By “high-level,” we mean that we wish to explicitly relate characteristics of an algorithm, such as its inherent parallelism or memory and communication behavior, with parameters of an architecture, such as the number of cores, structure of the memory hierarchy, or network topology. Our ultimate goal is to say, in broad but also quantitative terms, how macroscopic changes to an architecture might affect the execution time, scalability, accuracy, and power-efficiency of a computation; and, conversely, identify what classes of computation might best match a given architecture. The approach we shall outline marries abstract algorithmic complexity analysis with caps on power and die area, which are arguably the central first-order constraints on the extremescale systems of 2018 and beyond [1, 16, 21, 29, 41]. We refer to our approach as one of algorithm-architecture co-design.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. The National Academies Press, Washington, DC (2008)
Arge, L., Goodrich, M.T., Nelson, M., Sitchinava, N.: Fundamental parallel algorithms for private-cache chip multiprocessors. In: Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures, SPAA 2008, p. 197. ACM Press, New York (2008)
Badia, R.M., Rodriguez, G., Labarta, J.: Deriving analytical models from a limited number of runs. In: Proceedings of Parallel Computing, ParCo, Minisymposium on Performance Analysis, pp. 1–6 (2003)
Barker, K., Benner, A., Hoare, R., Hoisie, A., Jones, A., Kerbyson, D., Li, D., Melhem, R., Rajamony, R., Schenfeld, E., Shao, S., Stunkel, C., Walker, P.: On the Feasibility of Optical Circuit Switching for High Performance Computing Systems. In: ACM/IEEE SC 2005 Conference, SC 2005. IEEE (2005), http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1559968&tag=1
Barker, K.J., Hoisie, A., Kerbyson, D.J.: An early performance analysis of POWER7-IH HPC systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on SC 2011, p. 1. ACM Press, New York (2011)
Blelloch, G.E.: Programming parallel algorithms. Communications of the ACM 39(3), 85–97 (1996)
Blelloch, G.E., Gibbons, P.B., Simhadri, H.V.: Low depth cache-oblivious algorithms. In: Proc. ACM Symp. Parallel Algorithms and Architectures, SPAA, Santorini, Greece (June 2010)
Carrington, L., Snavely, A., Wolter, N.: A performance prediction framework for scientific applications. Future Generation Computer Systems 22(3), 336–346 (2006)
Casas, M., Badia, R.M., Labarta, J.: Prediction of behavior of MPI applications. In: 2008 IEEE International Conference on Cluster Computing, pp. 242–251. IEEE (September 2008)
Chandramowlishwaran, A., Choi, J.W., Madduri, K., Vuduc, R.: Towards a communication optimal fast multipole method and its implications for exascale. In: Proc. ACM Symp. Parallel Algorithms and Architectures, pp. 182–184. ACM, New York (2012), http://dl.acm.org/citation.cfm?id=2312039
Chowdhury, R.A., Silvestri, F., Blakeley, B., Ramachandran, V.: Oblivious algorithms for multicores and network of processors. In: 2010 IEEE International Symposium on Parallel & Distributed Processing, IPDPS, pp. 1–12. IEEE (2010)
Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K.E., Santos, E., Subramonian, R., von Eicken, T.: LogP: Towards a realistic model of parallel computation. ACM SIGPLAN Notices 28(7), 1–12 (1993)
Czechowski, K., McClanahan, C., Battaglino, C., Iyer, K., Yeung, P.-K., Vuduc, R.: On the communication complexity of 3D FFTs and its implications for exascale. In: Proc. ACM Int’l. Conf. Supercomputing, ICS, San Servolo Island, Venice, Italy (June 2012) (to appear)
Demmel, J.W.: Applied Numerical Linear Algebra. SIAM (1997)
Desprez, F., Markomanolis, G.S., Quinson, M., Suter, F.: Assessing the Performance of MPI Applications through Time-Independent Trace Replay. In: 2011 40th International Conference on Parallel Processing Workshops, pp. 467–476. IEEE (September 2011)
Dongarra, J., Beckman, P., Aerts, P., Cappello, F., Lippert, T., Matsuoka, S., Messina, P., Moore, T., Stevens, R., Trefethen, A., Valero, M.: The International Exascale Software Project: A call to cooperative action by the global high performance community. In: Int’l. J. High-Performance Computing Applications, IJHPCA, vol. 23(4), pp. 309–322 (2009), http://hpc.sagepub.com/content/23/4/309
Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: Proc. Symp. Foundations of Computer Science, FOCS, New York, NY, USA, pp. 285–297 (October 1999)
Ghosh, S., Martonosi, M., Malik, S.: Cache miss equations: A compiler framework for analyzing and tuning memory behavior. ACM Trans. Programming Languages and Systems (TOPLAS) 21(4), 703–746 (1999)
Gonzalez, J., Gimenez, J., Casas, M., Moreto, M., Ramirez, A., Labarta, J., Valero, M.: Simulating Whole Supercomputer Applications. IEEE Micro 31(3), 32–45 (2011)
Guz, Z., Bolotin, E., Keidar, I., Kolodny, A., Mendelson, A., Weiser, U.: Many-Core vs. Many-Thread Machines: Stay Away From the Valley. IEEE Computer Architecture Letters 8(1), 25–28 (2009)
Hemmert, K.S., Vetter, J.S., Bergman, K., Das, C., Emami, A., Janssen, C., Panda, D.K., Stunkel, C., Underwood, K., Yalamanchili, S.: IAA Interconnection Networks Workshop 2008. Technical Report FTGTR-2009-03, Future Technologies Group, Oak Ridge National Laboratory (April 2009), http://ft.ornl.gov/pubs-archive/iaa-ic-2008-workshop-report-final.pdf
Hill, M.D., Marty, M.R.: Amdahl’s Law in the multicore era. IEEE Computer 41(7), 33–38 (2008)
Hoefler, T., Schneider, T., Lumsdaine, A.: LogGOPSim: Simulating large-scale applications in the LoGOPS model. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC 2010, p. 597. ACM Press, New York (2010)
Hoisie, A., Johnson, G., Kerbyson, D.J., Lang, M., Pakin, S.: A performance comparison through benchmarking and modeling of three leading supercomputers: Blue Gene/L, Red Storm, and Purple. In: Proc. ACM/IEEE Conf. Supercomputing, SC, number 74, Tampa, FL, USA (November 2006)
Jagode, H., Knupfer, A., Dongarra, J., Jurenz, M., Mueller, M.S., Nagel, W.E.: Trace-based performance analysis for the petascale simulation code FLASH. International Journal of High Performance Computing Applications (December 2010)
Kerbyson, D.J., Alme, H.J., Hoisie, A., Petrini, F., Wasserman, H.J., Gittings, M.: Predictive performance and scalability modeling of a large-scale application. In: Proceedings of the 2001 ACM/IEEE Conference on Supercomputing (CDROM) - Supercomputing 2001, p. 37. ACM Press, New York (2001)
Kerbyson, D.J., Hoisie, A., Wasserman, H.: Modelling the performance of large-scale systems. In: IEE Proceedings–Software, vol. 150, pp. 214–221 (August 2003)
Kerbyson, D.J., Jones, P.W.: A Performance Model of the Parallel Ocean Program. International Journal of High Performance Computing Applications 19(3), 261–276 (2005)
Kogge, P., Bergman, K., Borkar, S., Campbell, D., Carlson, W., Dally, W., Denneau, M., Franzon, P., Harrod, W., Hill, K., Hiller, J., Karp, S., Keckler, S., Klein, D., Lucas, R., Richards, M., Scarpelli, A., Scott, S., Snavely, A., Sterling, T., Williams, R.S., Yelick, K.: Exascale Computing Study: Technology challenges in acheiving exascale systems (September 2008), http://users.ece.gatech.edu/~mrichard/ExascaleComputingStudyReports/ECS_reports.htm
Kung, H.: Let’s design algorithms for VLSI systems. In: Proceedings of the Caltech Conference on VLSI: Architecture, Design, and Fabrication, pp. 65–90 (1979)
Lengauer, T.: VLSI theory. In: Handbook of Theoretical Computer Science, ch. 16, pp. 837–865. Elsevier Science Publishers G.V. (1990)
Lively, C.W., Taylor, V.E., Alam, S.R., Vetter, J.S.: A methodology for developing high fidelity communication models for large-scale applications targeted on multicore systems. In: Proc. Int’l. Symp. Computer Architecture and High Performance Computing, SBAC-PAD, Mato Grosso do Sul, Brazil, pp. 55–62 (October 2008)
Mandel, J., Parter, S.V.: On the multigrid F-cycle. Applied Mathematics and Computation 37(1), 19–36 (1990)
Numrich, R.W.: Computational force: A unifying concept for scalability analysis. In: Advances in Parallel Computing, vol. 15. IOS Press (2008)
Numrich, R.W.: A metric space for computer programs and the Principle of Computational Least Action. J. Supercomputing 43(3), 281–298 (2008)
Numrich, R.W., Heroux, M.A.: Self-similarity of parallel machines. Parallel Computing 37(2), 69–84 (2011)
Rodrigues, A.F., et al.: The structural simulation toolkit. ACM SIGMETRICS Performance Evaluation Review 38(4), 37 (2011)
Rosenberg, A.L.: Three-Dimensional VLSI: a case study. Journal of the ACM 30(3), 397–416 (1983)
Rosenfeld, P., Cooper-Balis, E., Jacob, B.: DRAMSim2: A Cycle Accurate Memory System Simulator. IEEE Computer Architecture Letters 10(1), 16–19 (2011)
Savage, J.E.: Models of Computation: Exploring the power of computing. CC-3.0, BY-NC-ND, electronic edition (2008)
Simon, H., Zacharia, T., Stevens, R.: Modeling and simulation at the exascale for energy and the environment. Technical report, Office of Science, U.S. Dept. of Energy (May 2008), http://www.sc.doe.gov/ascr/ProgramDocuments/Docs/TownHall.pdf
Snavely, A., Wolter, N., Carrington, L.: Modeling application performance by convolving machine signatures with application profiles. In: Proceedings of the Fourth Annual IEEE International Workshop on Workload Characterization, WWC-4 (Cat. No.01EX538), pp. 149–156. IEEE
Thompson, C.D.: Area-time complexity for VLSI. In: Proceedings of the Eleventh Annual ACM Symposium on Theory of Computing, STOC 1979, pp. 81–88. ACM Press, New York (1979)
Toledo, S.: Locality of reference in LU decomposition with partial pivoting. SIAM J. Matrix Anal. Appl. 18(4), 1065–1081 (1997)
Valiant, L.G.: A bridging model for parallel computation. Communications of the ACM 33(8), 103–111 (1990)
Valiant, L.G.: A bridging model for multi-core computing. In: Halperin, D., Mehlhorn, K. (eds.) ESA 2008. LNCS, vol. 5193, pp. 13–28. Springer, Heidelberg (2008)
van Gemund, A.J.: Symbolic performance modeling of parallel systems. IEEE Transactions on Parallel and Distributed Systems 54(7), 922–927 (2005)
Wickremesinghe, R., Arge, L., Chase, J.S., Vitter, J.S.: Efficient sorting using registers and caches. J. Experimental Algorithmics (JEA) 7, 9 (2002)
Woo, D.H., Lee, H.-H.S.: Extending Amdahl’s Law for energy-efficient computing in the many-core era. IEEE Computer 41(12), 24–31 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Vuduc, R., Czechowski, K. (2013). Toward a Theory of Algorithm-Architecture Co-design. In: Daydé, M., Marques, O., Nakajima, K. (eds) High Performance Computing for Computational Science - VECPAR 2012. VECPAR 2012. Lecture Notes in Computer Science, vol 7851. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38718-0_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-38718-0_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38717-3
Online ISBN: 978-3-642-38718-0
eBook Packages: Computer ScienceComputer Science (R0)