Extending Modulo Scheduling with Memory Reference Merging

  • Benoît Dupont de Dinechin
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1575)


We describe an extension of modulo scheduling, called “memory reference merging”, which improves the management of cache bandwidth on microprocessors such as the DEC Alpha 21164. The principle is to schedule together memory references that are likely to be merged in a read buffer (LOADs), or a write buffer (STOREs). This technique has been used over several years on the Cray T3E block scheduler, and was later generalized to the Cray T3E software pipeliner. Experiments on the Cray T3E demonstrate the benefits of memory reference merging.


Memory Reference Memory Hierarchy Loop Body Software Pipeline Cache Block 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Bar-Yehuda, R., Even, S.: A Linear-Time Approximation for the Weighted Set Cover Problem. Journal of Algorithms 2 (1981)Google Scholar
  2. 2.
    Carr, S., Guan, Y.: Unroll-and-Jam Using Uniformly Generated Sets Micro-30. In: Proceedings of the 30th International Symposium on Microarchitecture (December 1997)Google Scholar
  3. 3.
    Dehnert, J.C., Towle, R.A.: Compiling for Cydra 5. Journal of Supercomputing 7, 181–227 (1993)CrossRefGoogle Scholar
  4. 4.
    Ding, C., Carr, S., Sweany, P.: Modulo Scheduling with Cache-Reuse Information. In: Lengauer, C., Griebl, M., Gorlatch, S. (eds.) Euro-Par 1997. LNCS, vol. 1300. Springer, Heidelberg (1997)CrossRefGoogle Scholar
  5. 5.
    Dupont de Dinechin, B.: Insertion Scheduling: An Alternative to List Scheduling for Modulo Schedulers. In: Huang, C.-H., Sadayappan, P., Banerjee, U., Gelernter, D., Nicolau, A., Padua, D.A. (eds.) LCPC 1995. LNCS, vol. 1033. Springer, Heidelberg (1996)CrossRefGoogle Scholar
  6. 6.
    Dupont de Dinechin, B.: Parametric Computation of Margins and of Minimum Cumulative Register Lifetime Dates. In: Sehr, D., Banerjee, U., Gelernter, D., Nicolau, A., Padua, D.A. (eds.) LCPC 1996. LNCS, vol. 1239. Springer, Heidelberg (1997)CrossRefGoogle Scholar
  7. 7.
    Dupont de Dinechin, B.: A Unified Software Pipeline Construction Scheme for Modulo Scheduled Loops. In: Malyshkin, V.E. (ed.) PaCT 1997. LNCS, vol. 1277. Springer, Heidelberg (1997)Google Scholar
  8. 8.
    Internal Organization of the Alpha 21164, a 300-MHz 64-bit Quad-issue CMOS RISC Microprocessor. Digital Technical Journal 7(1) (January 1995)Google Scholar
  9. 9.
    Farkas, K.I., Jouppi, N.P.: Complexity/Performance Tradeoffs with Non-Blocking Loads WRL Research Report 94/3, Western Research Laboratory (March 1994)Google Scholar
  10. 10.
    Gupta, A., Hennessy, J., Gharachorloo, K., Mowry, T., Weber, W.-D.: Comparative Evaluation of Latency Reducing and Tolerating Techniques. In: ISCA 1991–18th International Symposium on Computer Architecture (May 1991)Google Scholar
  11. 11.
    Alpha 21164 Microprocessor Hardware Reference Manual, Document EC-QAEQBTE, Digital Equipment CorporationGoogle Scholar
  12. 12.
    Hsu, P.Y.-T.: Design of the R8000 Microprocessor IEEE Micro (1993)Google Scholar
  13. 13.
    Huff, R.A.: Lifetime-Sensitive Modulo Scheduling PLDI 1993. In: Conference on Programming Language Design and Implementation (June 1993)Google Scholar
  14. 14.
    Kessler, R.E.: Livermore Loops Single-Node Code Optimization for the CRAY T3E Technical Report, System Performance Group, Cray Research Inc. (1995)Google Scholar
  15. 15.
    Kroft, D.: Lockup-Free Fetch/Prefetch Cache Organization ISCA 1981. In: 8th International Symposium on Computer Architecture (May 1981)Google Scholar
  16. 16.
    Lam, M.: Software Pipelining: An Effective Scheduling Technique for VLIW Machines PLDI 1988. In: Conference on Programming Language Design and Implementation (1988)Google Scholar
  17. 17.
    López, D., Llosa, J., Valero, M., Ayguadé, E.: Resource Widening Versus Replication: Limits and Performance-Cost Trade-off ICS 12. In: 12th International Conference on Supercomputing, Melbourne, Australia (July 1998)Google Scholar
  18. 18.
    López, D., Valero, M., Llosa, J., Ayguadé, E.: Increasing Memory Bandwidth with Wide Buses: Compiler, Hardware and Performance Trade-offs ICS-11. In: 11th International Conference on Supercomputing, Vienna, Austria (July 1997)Google Scholar
  19. 19.
    McKinley, K., Carr, S., Tseng, C.-W.: Improving Data Locality with Loop Transformations. ACM Transactions on Programming Languages and Systems 18(4) (July 1996)Google Scholar
  20. 20.
    Mowry, T.C., Lam, M.S., Gupta, A.: Design and Evaluation of a Compiler Algorithm for Prefetching ASPLOS-V. In: Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, Boston, MA (1992)Google Scholar
  21. 21.
    Rau, B.R., Glaeser, C.D.: Some Scheduling Techniques and an Easily Schedulable Horizontal Architecture for High Performance Scientific Computing. In: 14th Annual Workshop on Microprogramming (October 1981)Google Scholar
  22. 22.
    Rau, B.R.: Iterative Modulo Scheduling: An Algorithm for Software Pipelining Loops MICRO-27. In: 27th Annual International Symposium on Microarchitecture, San Jose, California (November 1994)Google Scholar
  23. 23.
    Rau, B.R., Schlansker, M.S., Tirumalai, P.P.: Code Generation Schemas for Modulo Scheduled Loops MICRO-25. In: 25th Annual International Symposium on Microarchitecture, Portland (December 1992)Google Scholar
  24. 24.
    Ruttenberg, J.C., Gao, G.R., Stoutchinin, A., Lichtenstein, W.: Software Pipelining Showdown: Optimal vs. Heuristic Methods in a Production Compiler PLDI 1996. In: Conference on Programming Language Design and Implementation, Philadelphia, PA (May 1996)Google Scholar
  25. 25.
    Scott, S.L.: Synchronization and Communication in the T3E Multiprocessor ASPLOS-VII. In: Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge (October 1996)Google Scholar
  26. 26.
    Skadron, K., Clark, D.W.: Design Issues and Tradeoffs for Write Buffers HPCA 1997. In: Proceedings of the 3rd International Symposium on Computer Architecture, San Antonio, TX (Febuary 1997)Google Scholar
  27. 27.
    Stoutchinin, A.: An Integer Linear Programming Model of Software Pipelining for the MIPS R8000 Processor PaCT 1997. In: 4th International Conference on Parallel Computing Technologies, Yaroslavl, Russia (September 1997)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1999

Authors and Affiliations

  • Benoît Dupont de Dinechin
    • 1
  1. 1.CMG/MDT DivisionST Microelectronics 

Personalised recommendations