International Journal of Parallel Programming

, Volume 44, Issue 2, pp 337–380 | Cite as

Memory Partitioning in the Limit

  • Emre Kültürsay
  • Kemal Ebcioğlu
  • Gürhan Küçük
  • Mahmut T. Kandemir


The key difficulties in designing memory hierarchies for future computing systems with extreme scale parallelism include (1) overcoming the design complexity of system-wide memory coherence, (2) achieving low power, and (3) achieving fast access times within such a memory hierarchy. Towards addressing these difficulties, in this paper we propose an automatic memory partitioning method to generate a customized, application-specific, energy-efficient, low latency memory hierarchy, tailored to particular application programs. Given a software program to accelerate, our method automatically partitions the memory of the original program, creates a new customized application-specific multi-level memory hierarchy for the program, and modifies the original program to use the new memory hierarchy. This new memory hierarchy and modified program are then used as the basis to create a customized, application-specific, highly parallel hardware accelerator, which is functionally equivalent to the original, unmodified program. Using dependence analysis and fine grain valid/dirty bits, the memories in the generated hierarchy can operate in parallel without the need for maintaining coherence and can be independently initialized/flushed from/to their parent memories in the hierarchy, enabling a scalable memory design. The generated memories are fully compatible with the memory addressing in the original software program; this compatibility feature enables the translation of general software applications to application-specific accelerators. We also provide a compiler analysis method to perform accurate dependence analysis for memory partitioning based on symbolic execution, and a profiler-based futuristic limit study to identify the maximum gains that can be achieved by memory partitioning.


Memory partitioning Parallel processing Application-specific hardware accelerators Exascale computing  Supercomputers 


  1. 1.
    Anderson, Jennifer M., Amarasinghe, Saman P., Lam, Monica S.: Data and computation transformations for multiprocessors. In: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP ’95, pp. 166–178, New York, NY, USA. ACM (1995)Google Scholar
  2. 2.
    Anderson, Jennifer M., Lam, Monica S.: Global optimizations for parallelism and locality on scalable parallel machines. In: Proceedings of the ACM SIGPLAN 1993 Conference on Programming Language Design and Implementation, PLDI ’93, pp. 112–125 (1993)Google Scholar
  3. 3.
    Avissar, Oren, Barua, Rajeev, Stewart, Dave: An optimal memory allocation scheme for scratch-pad-based embedded systems. ACM Trans. Embed. Comput. Syst. 1(1), 6–26 (2002)CrossRefGoogle Scholar
  4. 4.
    Banakar, R., Steinke, S., Lee, B.-S., Balakrishnan, M., Marwedel, P.: Scratchpad memory: design alternative for cache on-chip memory in embedded systems. In: Proceedings of the Tenth International Symposium on Hardware/software codesign, CODES ’02, (2002)Google Scholar
  5. 5.
    Banerjee, U.K.: Dependence Analysis for Supercomputing. Kluwer Academic Publishers, Norwell (1988)CrossRefGoogle Scholar
  6. 6.
    Baradaran, Nastaran, Diniz, Pedro C.: A compiler approach to managing storage and memory bandwidth in configurable architectures. ACM Trans. Des. Autom. Electron. Syst. 13(4), 61:1–61:26 (2008)CrossRefGoogle Scholar
  7. 7.
    Benini, L., Macchiarulo, L., Macii, A., Poncino, M.: From architecture to layout: partitioned memory synthesis for embedded systems-on-chip. In: Proceedings of Design Automation Conference, 2001, pp. 784–789 (2001)Google Scholar
  8. 8.
    Benini, L., Macii, A., Poncino, M.: A recursive algorithm for low-power memory partitioning. In: Proceedings of the 2000 International Symposium on Low Power Electronics and Design, ISLPED ’00, pp. 78–83, ACM (2000)Google Scholar
  9. 9.
    Blume, W., Eigenmann, R.: The range test: a dependence test for symbolic, non-linear expressions. In: Supercomputing ’94, (1994)Google Scholar
  10. 10.
    Bobda, Christophe: Introduction to Reconfigurable Computing: Architectures, Algorithms, and Applications, 1st edn. Springer, New York (2007)CrossRefMATHGoogle Scholar
  11. 11.
    Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., Sarkar, V.: X10: an object-oriented approach to non-uniform cluster computing. In: Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA ’05, pp. 519–538 (2005)Google Scholar
  12. 12.
    Chen, T., Lin, J., Dai, X., Hsu, W.-C., Yew, P.-C.: Data dependence profiling for speculative optimizations. In: Evelyn Duesterwald (ed.) Compiler Construction, vol. 2985 of Lecture Notes in Computer Science, pp. 57–72 (2004)Google Scholar
  13. 13.
    Cimitile, A., De Lucia, A., Munro, M.: Qualifying reusable functions using symbolic execution. In: Proceedings of the Second Working Conference on Reverse Engineering (1995)Google Scholar
  14. 14.
    Coen-Porisini, A., De Paoli, F., Ghezzi, C., Mandrioli, D.: Software specialization via symbolic execution. IEEE Trans. Softw. Eng. 17(9), 884–889 (1991)CrossRefGoogle Scholar
  15. 15.
    Cong, J., Jiang, W., Liu, B., Zou, Y.: Automatic memory partitioning and scheduling for throughput and power optimization. In: Computer-Aided Design—Digest of Technical Papers, 2009. ICCAD 2009. IEEE/ACM International Conference on, pp. 697–704 (2009)Google Scholar
  16. 16.
    Csallner, C., Tillmann, N., Smaragdakis, Y.: Dysy: dynamic symbolic execution for invariant inference. In: Proceedings of the 30th International Conference on Software Engineering (2008)Google Scholar
  17. 17.
    Dehnert, J.C., Grant, B.K., Banning, J.P., Johnson, R., Kistler, T., Klaiber, A., Mattson, J.: The transmeta code morphing software: using speculation, recovery, and adaptive retranslation to address real-life challenges. In: Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, CGO ’03, pp. 15–24, Washington, DC, USA. IEEE Computer Society (2003)Google Scholar
  18. 18.
    Ebcioğlu, K., Altman, E.R.: Daisy: dynamic compilation for 100% architectural compatibility. In: Proceedings of the 24th Annual International Symposium on Computer Architecture, ISCA ’97, pp. 26–37, New York, NY, USA. ACM (1997)Google Scholar
  19. 19.
    El-Ghazawi, T., Cantonnet, F.: Upc performance and potential: a NPB experimental study. In: Proceedings of the 2002 ACM/IEEE Conference On Supercomputing, Supercomputing ’02, pp. 1–26 (2002)Google Scholar
  20. 20.
    Elkarablieh, B., Godefroid, P., Levin, M.Y.: Precise pointer reasoning for dynamic test generation. In: ISSTA ’09: Proceedings of the Eighteenth International Symposium on Software Testing and Analysis (2009)Google Scholar
  21. 21.
    Fahringer, T., Scholz, B.: A unified symbolic evaluation framework for parallelizing compilers. IEEE Trans. Parallel Distrib. Syst. 11(11), 1110–1125 (2000)Google Scholar
  22. 22.
    Feautrier, Paul: Dataflow analysis of array and scalar references. Int. J. Parallel Program. 20(1), 23–53 (1991)CrossRefMATHGoogle Scholar
  23. 23.
    Gokhale, Maya B., Graham, Paul S.: Reconfigurable Computing: Accelerating Computation with Field-Programmable Gate Arrays, 1st edn. Springer, New York (2010)Google Scholar
  24. 24.
    Gokhale, M.B., Stone, J.M.: Automatic allocation of arrays to memories in fpga processors with multiple memory banks. In: Proceedings of the Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines, FCCM ’99 (1999)Google Scholar
  25. 25.
    Haghighat, M., Polychronopoulos, C.: Symbolic program analysis and optimization for parallelizing compilers. In: Banerjee, Utpal, Gelernter, David, Nicolau, Alex, Padua, David (eds.) Languages and Compilers for Parallel Computing, Volume 757 of Lecture Notes in Computer Science, pp. 538–562. Springer, Berlin (1993)CrossRefGoogle Scholar
  26. 26.
    Heinrich, J.: Origin and onyx2 theory of operations manual, silicon graphics corporation. Document number 007-3439-002, (1997).
  27. 27.
    Henning, J.L.: Spec cpu2006 benchmark descriptions. SIGARCH Comput. Arch. News 34(4), 1–17 (2006)MathSciNetCrossRefGoogle Scholar
  28. 28.
    Huffman, D.A.: A method for the construction of minimum-redundancy codes. Proc. IRE 40(9), 1098–1101 (1952)CrossRefMATHGoogle Scholar
  29. 29.
    Ketterlin, A., Clauss, P.: Profiling data-dependence to assist parallelization: framework, scope, and optimization. In: Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’12 (2012)Google Scholar
  30. 30.
    King, J.C.: Symbolic execution and program testing. Communications of ACM 19(7), 385–394 (1976)MathSciNetCrossRefMATHGoogle Scholar
  31. 31.
    Krambeck, R.H., Lee, C.M., Law, H.-F.S.: High-speed compact circuits with cmos. IEEE J. Solid-State Circuits 17(3), 614–619 (1982)CrossRefGoogle Scholar
  32. 32.
    Larus, J.R.: Loop-level parallelism in numeric and symbolic programs. IEEE Trans. Parallel Distrib. Syst. 4(7), 812–826 (1993)CrossRefGoogle Scholar
  33. 33.
    Mahapatra, Nihar R., Liu, Jiangjiang, Sundaresan, Krishnan, Dangeti, Srinivas, Venkatrao, Balakrishna V.: A limit study on the potential of compression for improving memory system performance, power consumption, and cost. J. Instr. Level Parallelism 7, 1–37 (2005)Google Scholar
  34. 34.
    Moon, Soo-Mook, Ebcioğlu, Kemal: Parallelizing nonnumerical code with selective scheduling and software pipelining. ACM Trans. Program. Lang. Syst. 19(6), 853–898 (1997)CrossRefGoogle Scholar
  35. 35.
    Muralimanohar, N., Balasubramonian, R., Jouppi, N.P.: CACTI 6.0. (2009)
  36. 36.
    Necula, G.C.: Translation validation for an optimizing compiler. SIGPLAN Not. 35(5), 83–94 (2000)CrossRefGoogle Scholar
  37. 37.
    Numrich, Robert W., Reid, John: Co-array fortran for parallel programming. SIGPLAN Fortran Forum 17(2), 1–31 (1998)CrossRefGoogle Scholar
  38. 38.
    Pugh, W.: The omega test: a fast and practical integer programming algorithm for dependence analysis. In: Proceedings of the 1991 ACM/IEEE Conference on Supercomputing, Supercomputing ’91, pp. 4–13, ACM (1991)Google Scholar
  39. 39.
    Rinard, M.C., Diniz, P.C.: Commutativity analysis: a new analysis framework for parallelizing compilers. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (1996)Google Scholar
  40. 40.
    Rogers Jr, Hartley: Theory of Recursive Functions and Effective Computability. MIT Press, Cambridge (1987)MATHGoogle Scholar
  41. 41.
    Rul, S., Vandierendonck, H., De Bosschere, K.: Towards automatic program partitioning. In: Proceedings of the 6th ACM Conference on Computing frontiers, CF ’09 (2009)Google Scholar
  42. 42.
    Shannon, Claude E., Weaver, Warren: A Mathematical Theory of Communication. University of Illinois Press, Champaign (1963)MATHGoogle Scholar
  43. 43.
    Silberman, Gabriel M., Ebcioglu, Kemal: An architectural framework for supporting heterogeneous instruction-set architectures. Computer 26(6), 39–56 (1993)CrossRefGoogle Scholar
  44. 44.
    So, B., Hall, M.W., Ziegler, H.E.: Custom data layout for memory parallelism. In: Proceedings of the International Symposium on Code Generation and Optimization: Feedback-Directed and Runtime Optimization, CGO ’04 (2004)Google Scholar
  45. 45.
    Standard Performance Evaluation Committee. Spec cpu2000 benchmarks, (2000).
  46. 46.
    Tarjan, Robert Endre: Depth-first search and linear graph algorithms. SIAM J. Comput. 1(2), 146–160 (1972)MathSciNetCrossRefMATHGoogle Scholar
  47. 47.
    Weinhardt, M., Luk, W.: Memory access optimization and ram inference for pipeline vectorization. In: FPL (1999)Google Scholar
  48. 48.
    Yelick, K., Bonachea, D., Chen, W.-Y., Colella, P., Datta, K., Duell, J., Graham, S.L., Hargrove, P., Hilfinger, P., Husbands, P., Lancu, C., Kamil, A., Nishtala, R., Su, J., Welcome, M., Wen, T.: Productivity and performance using partitioned global address space languages. In: Proceedings of the 2007 International Workshop on Parallel Symbolic Computation, PASCO ’07 (2007)Google Scholar
  49. 49.
    Yelick, K., Hilfinger, P., Graham, S., Bonachea, D., Su, J., Kamil, A., Datta, K., Colella, P., Wen, T.: Parallel languages and compilers: perspective from the titanium experience. Int. J. High Perform. Comput. Appl. 21(3), 266–290 (2007)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Emre Kültürsay
    • 1
  • Kemal Ebcioğlu
    • 2
  • Gürhan Küçük
    • 3
  • Mahmut T. Kandemir
    • 1
  1. 1.Pennsylvania State UniversityUniversity ParkUSA
  2. 2.Global Supercomputing CorporationYorktown HeightsUSA
  3. 3.Faculty of Engineering and ArchitectureYeditepe UniversityIstanbulTurkey

Personalised recommendations