A Flexible Multi-port Caching Scheme for Reconfigurable Platforms

  • Su-Shin Ang
  • George Constantinides
  • Peter Cheung
  • Wayne Luk
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3985)


Memory accesses contribute sunstantially to aggregate system delays. It is critical for designers to ensure that the memory subsystem is designed efficiently, and much work has been done on the exploitation of data re-use for algorithms that exhibit static memory access patterns in FPGAs. The proposed scheme enables the exploitation of data re-use for both static and non-static parallel memory access patterns through the use of a multi-port cache, where parameters can be determined at compile time and matched to the statistical properties of the application, and where sub-cache contentions are arbitrated with a semaphore-based system. A complete hardware implementation demonstrates that, for a motion vector estimation benchmark, the proposed caching scheme results in a cycle count reduction of 51% and execution time reduction of up to 24%, using a Xilinx XC2V6000 FPGA on a Celoxica RC300 board. Hardware resource usage and clock frequency penalties are analyzed while varying the number of ports and cache size. Consequently, it is demonstrated how the optimum cache size and number of ports may be established for a given datapath.


Execution Time Memory Access Resource Usage External Memory Cache Size 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Issenin, I., Dutt, N.: Automatic generation of affine functions for memory optimizations. In: Proceedings of the conference on Design, Automation and Test in Europe, pp. 808–813 (2005)Google Scholar
  2. 2.
    Kandemir, M., Choudhary, A.: Compiler-directed scratch-pad memory hierarchy design and management. In: Proceedings of the Design Automation Conference, pp. 628–633 (2002)Google Scholar
  3. 3.
    Udayakumaran, A., Barua, R.: Compiler-decided dynamic memory allocation for scratch-pad based embedded systems. In: Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems, pp. 276–279 (2003)Google Scholar
  4. 4.
    Chalidabhongse, J., Kuo, C.: Fast motion vector estimation using multiresolution-spatio-temporal correlations. IEEE transactions on circuits and systems for video technology 7(3), 477–488 (1997)CrossRefGoogle Scholar
  5. 5.
    Patterson, D.A., Hennessy, J.L.: Computer Architecture: A Quantitative Approach. Morgan Kaufmann, San Francisco (1996)zbMATHGoogle Scholar
  6. 6.
    Kulkarni, C., Catthoor, F., Man, H.D.: Data and memory optimization techniques for embedded systems. In: Proceedings of the IPDPS Workshops on Parallel and Distributed Processing, pp. 186–193 (2000)Google Scholar
  7. 7.
    Panda, P., Catthoor, F., Danckaert, K., Brockmeyer, E., Kulkarni, C., Vandercappelle, A., Kjeldsberg, P.: Data and memory optimization techniques for embedded systems. IEEE Transactions on Very Large Scale Integr. Syst. 6(2), 149–206 (2001)Google Scholar
  8. 8.
    Ishihara, T., Fallah, F.: A way memoization technique for reducing power consumption in caches in Application Specific Integrated Procesors. In: Proceedings of the conference on Design, Automation and Test in Europe, pp. 358–363 (2005)Google Scholar
  9. 9.
    Nastaran, B., Park, J., Diniz, P.: A compiler analysis and algorithm for exploiting data reuse in configurable architectures with RAM blocks. In: Proceedings of the Field-Programmable Logic and Applications, pp. 1113–1115 (2004)Google Scholar
  10. 10.
    Guo, Z., Buyukkurt, B., Najjar, W., Vissers, K.: Optimized generation of data-paths from C codes for FPGAs. In: Proceedings of the conference on Design, Automation and Test in Europe, pp. 112–118 (2005)Google Scholar
  11. 11.
    Sohi, G.S., Franklin, M.: High-bandwidth data memory systems for superscalar processors. In: Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 53–62 (1991)Google Scholar
  12. 12.
    Edmondson, J., Rubinfield, P., Bannon, P., Benschneider, B., Berstein, D., Castelino, R., Cooper, E., Dever, D., Donchin, D., Fischer, T., Jain, A., Mehta, S., Meyer, J., Preston, R., Rajagopalan, V., Somanathan, C., Taylor, S., Wolrich, G.: Internal organization of the Alpha 21164 a 300 MHz 64-bit quad-issue CMOS RISC microprocessor. Digital Technical Journal 7(1), 119–135 (1995)Google Scholar
  13. 13.
    Page, I., Luk, W.: Compiling Occam into FPGAs. In: Proceedings of the Field-Programmable Logic and Applications, pp. 271–283 (1991)Google Scholar
  14. 14.
    Intel: Understanding memory access characteristics of motion estimation algorithms (accessed October 1, 2005),
  15. 15.
    Celoxica: DK compiler (accessed October 1, 2005),
  16. 16.
    Celoxica: RC300 board (accessed October 1, 2005),
  17. 17.
    Xilinx: Virtex 2 datasheet (accessed October 1, 2005),
  18. 18.
    Celoxica: RC300 manual (accessed October 1, 2005),
  19. 19.
    Bouganis, C.S., Constantinides, G., Cheung, P.Y.K.: A novel 2-D design methodology for heterogeneous devices. In: Proceedings of the IEEE International Symposium on Field Programmable Custom Computing Machines, pp. 1–10 (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Su-Shin Ang
    • 1
  • George Constantinides
    • 1
  • Peter Cheung
    • 1
  • Wayne Luk
    • 2
  1. 1.Dept. of Electrical and Electronics EngineeringImperial CollegeLondonUK
  2. 2.Dept. of ComputingImperial CollegeLondonUK

Personalised recommendations