Code Transformations for TLB Power Reduction

  • Reiley Jeyapaul
  • Aviral Shrivastava


The Translation Look-aside Buffer (TLB) is a very important part in the hardware support for virtual memory management implementation of high performance embedded systems. The TLB though small is frequently accessed, and therefore not only consumes significant energy, but also is one of the important thermal hot-spots in the processor. Recently, several circuit and microarchitectural implementations of TLBs have been proposed to reduce TLB power. One simple, yet effective TLB design for power reduction is the Use-Last TLB architecture proposed in IEEE J Solid State Circuits, 1190–1199, (2004). The Use-Last TLB architecture reduces the power consumption when the last page is accessed again. In this work, we develop code transformation techniques to reduce the page switchings in data cache accesses and propose an efficient page-aware code placement technique to enhance the energy reduction capabilities achieved by the Use-Last TLB architecture for instruction cache accesses. Our comprehensive page switch reduction algorithm results in an average of 39% reduction in the data-TLB page switching, and our code placement heuristic results in an average of 76% reduction in the instrucion-TLB page switchings with negligible impact on the performance on benchmarks from MiBench, Multimedia, DSPStone and BDTI suites. The reduced page switch count through our techniques achieves an equivalent power savings, above and beyond the reduction achieved by the Use-Last TLB architecture implementation.


Tlb power Code transformation Compiler technique I-TLB power D-TLB power Instruction scheduling Code placement 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ekman, M., Stenstrm, P., Dahlgren, F.: Tlb and snoop energy-reduction using virtual caches in low-power chip-multiprocessors. In: ISLPED ’02, pp. 243–246. ACM, New York (2004)Google Scholar
  2. 2.
    Kadayif I., Sivasubramaniam A., Kandemir M., Kandiraju G., Chen G.: Optimizing instruction tlb energy using software and hardware techniques. ACM Trans. Des. Autom. Electron. Syst. 10(2), 229–257 (2005)CrossRefGoogle Scholar
  3. 3.
    Zhou, X., Petrov, P.: Low-power cache organization through selective tag translation for embedded processors with virtual memory support. In: GLSVLSI ’06, pp. 398–403. ACM, New York (2004)Google Scholar
  4. 4.
    Petrov, P., Tracy, D., Orailoglu, A.: Energy-effcient physically tagged caches for embedded processors with virtual memory. In: DAC ’05, pp. 17–22. ACM, New York (2005)Google Scholar
  5. 5.
    Haigh J.R., Wilkerson M., Miller J., Beatty T., Strazdus S., Clark L.: A low-power 2.5 ghz 90 nm level 1 cache and memory management unit. IEEE J. Solid-State Circuits. 40(5), 1190–1199 (2005)CrossRefGoogle Scholar
  6. 6.
    Clark, L.T., Choi, B., Wilkerson, M.: Reducing translation lookaside buffer active power. In: ISLPED ’03, pp. 10–13. ACM, New York (2003)Google Scholar
  7. 7.
    Manne, S., Klauser, A., Grunwald, D., Somenzi, F.: Low power tlb design for high performance microprocessors [Online]. Available: (1997)
  8. 8.
    Lee, J.-H., Park, G.-H., Park, S.-B., Kim, S.-D.: A selective filter-bank tlb system. In: ISLPED ’03, pp. 312–317. ACM, New York (2003)Google Scholar
  9. 9.
    Choi J.-H., Lee J.-H., Jeong S.-W., Kim S.-D., Weems C.: A low power tlb structure for embedded systems. IEEE Comput. Archit. Lett. 1(1), 3 (2006)CrossRefGoogle Scholar
  10. 10.
    Chang, Y.-J.: An ultra low-power tlb design. In: DATE ’06: Proceedings of the Conference on Design, Automation and Test in Europe, pp. 1122–1127. European Design and Automation Association, 3001 Leuven, Belgium, Belgium (2006)Google Scholar
  11. 11.
    Kadayif, I., Nath, P., Kandemir, M., Sivasubramaniam, A.: Compiler-directed physical address generation for reducing dtlb power. In: ISPASS ’04, pp. 161–168. IEEE Computer Society, Los Alamitos, CA (2004)Google Scholar
  12. 12.
    Delaluz, V., Kandemir, M., Vijaykrishnan, N., Irwin, M., Sivasubramaniam, A., Kolcu, I.: Compiler-directed array interleaving for reducing energy in multi-bank memories. In: ASP-DAC ’02, pp. 288–293. IEEE Computer Society, Los Alamitos, CA (2002)Google Scholar
  13. 13.
    Parikh, A., Kim, S., Kandemir, M., Vijaykrishnan, N., Irwin, M.: Instruction scheduling for low power. In: VLSI-SP ’04, pp. 129–149. Springer, Netherlands (2004)Google Scholar
  14. 14.
    Chiyonobu, A., Sato, T.: Energy-efficient instruction scheduling utilizing cache miss information. In: MEDEA ’05: Proceedings of the 2005 Workshop on MEmory performance. IEEE Computer Society, Los Alamitos, CA (2005)Google Scholar
  15. 15.
    Kandemir, M., Kadayif, I., Chen, G.: Compiler-directed code restructuring for reducing data tlb energy. In: CODES+ISSS ’04, pp. 98–103. IEEE Computer Society, Washington (2004)Google Scholar
  16. 16.
    Intel Corporation. Intel XScale®R Technology Overview [Online]. Available:
  17. 17.
    Guthaus, M.R., Ringenberg, J.S., Ernst, D., Austin, T.M., Mudge, T., Brown, R.B.: Mibench: a free, commercially representative embedded benchmark suite. In: WWC ’01: Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop on, pp. 3–14. IEEE Computer Society, Washington, DC (2001)Google Scholar
  18. 18.
    Balakrishnan, H., Garg, R.: Multimedia benchmarks: a performance comparison of multimedia programs on different architectures [Online]. Available:
  19. 19.
    Zivojnovic, V., Velarde, J., Schlager, C., Meyr, H.: Dspstone: a dsp-oriented benchmarking methodology. In: Proceedings of Signal Processing Applications and Technology, Dallas (1994)Google Scholar
  20. 20.
    Henning J.L.: Spec cpu2000: measuring cpu performance in the new millennium. Computer 33(7), 28–35 (2000)CrossRefGoogle Scholar
  21. 21.
    BDTI Suite: Berkeley Design Technology Inc, The bdti benchmark suites [Online]. Available:
  22. 22.
    Austin, T.: Simple Scalar LLCGoogle Scholar
  23. 23.
    Shrivastava, A., Earlie, E., Dutt, N., Nicolau, A.: Operation tables for scheduling in the presence of incomplete bypassing. In: CODES+ISSS, pp. 194–199 (2004)Google Scholar
  24. 24.
    Issenin, I., Dutt, N.: Foray-gen: automatic generation of affine functions for memory optimizations. In: DATE ’05: Proceedings of the conference on Design, Automation and Test in Europe, pp. 808–813. IEEE Computer Society, Washington, DC (2005)Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.Compiler and Microarchitecture LaboratoryArizona State UniversityTempeUSA

Personalised recommendations