International Journal of Parallel Programming

, Volume 40, Issue 4, pp 381–396 | Cite as

Software Controlled Adaptive Pre-Execution for Data Prefetching



Data prefetching mechanisms are widely used for hiding memory latency in data intensive applications. They mask the speed gap between CPUs and their memory systems by preloading data into the CPU caches, where accessing them is by at least one order of magnitude faster. Pre-execution is a combined prefetching method, which executes a slice of the original code preloading the code and its data at the same time. Pre-execution is often mentioned in the literature, but according to our knowledge, it has not been formally defined yet. We fill this void by presenting the formal definition of speculative and non-speculative pre-execution, and derive a lightweight software-based strategy which accelerates the main working thread by introducing an adaptive, non-speculative pre-execution helper thread. This helper thread acts as a perfect predictor, calculates memory addresses, prefetches the data and consumes cache misses early. The adaptive automatic control allows the helper thread to configure itself in run-time for best performance. The method is directly applicable to any data intensive application without requiring hardware modifications. Our method was able to achieve an average speedup of 10–30% in a real-life application.


Pre-execution Data prefetch Self-configuration Data intensive application Performance 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bianchini R., Lim B.-H.: Evaluating the performance of multithreading and prefetching in multiprocessors. J. Parallel Distrib. Comput. 37(1), 83–97 (1996)CrossRefGoogle Scholar
  2. 2.
    Bryant, R.E.: Data-Intensive Supercomputing: The Case for DISC. Technical report, Technical Report CMU-CS-07-128, School of Computer Science, Carnegie Mellon University (2007)Google Scholar
  3. 3.
    Byna, S., Chen, Y., Sun, X.-H.: A Taxonomy of data prefetching mechanisms. In: Proceedings of the 2008 International Symposium on Parallel Architectures, Algorithms, and Networks (i-span 2008), pp. 19–24, Sydney (2008)Google Scholar
  4. 4.
    Chappell R.S., Stark J., Kim S.P., Reinhardt S.K., Patt Y.N.: Simultaneous subordinate microthreading (SSMT). Int. Symp. Comput. Archit. 27(2), 186–195 (1999)Google Scholar
  5. 5.
    Cintra, M., Llanos, D.R.: Toward efficient and robust software speculative parallelization on multiprocessors. In: PPoPP03 Proceedings of the ninth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 13–24 (2003)Google Scholar
  6. 6.
    Dubois, M.: Fighting the memory wall with assisted execution. In: Proceedings of the 1st Conference on Computing Frontiers, pp. 168–180, Ischia, Italy (2004)Google Scholar
  7. 7.
    Dundas, J., Mudge, T.: Improving data cache performance by pre-executing instructions under a cache miss. In: Proceedings of the 11th International Conference on Supercomputing—ICS ’97, ICS ’97, pp. 68–75, ACM Press, New York (1997)Google Scholar
  8. 8.
    Herlihy M., Shavit N.: The Art of Multiprocessor Programming. Morgan Kaufman Publishers, Burlington (2008)Google Scholar
  9. 9.
    Hillis W.D., Steele G.L.: Data parallel algorithms. Commun. ACM 29(12), 1170–1183 (1986)CrossRefGoogle Scholar
  10. 10.
    Juhász, S., Dudás, Á.: Adapting hash table design to real-life datasets. In: Proceedings of the IADIS European Conference on Informatics 2009, Part of the IADIS Multiconference of Computer Science and Information Systems 2009, pp. 3–10, Algarve, Portugal (2009)Google Scholar
  11. 11.
    Kim, D., Liao, S.S.-W., Wang, P.H., Cuvillo, J.D., Tian, X., Zou, X., Wang, H., Yeung, D., Girkar, M., Shen, J.P.: Physical experimentation with prefetching helper threads on intel’s hyper-threaded processors. In: Proceedings of the International Symposium on Code Generation and Optimization: Feedback-Directed and Runtime Optimization, pp. 27–38 (2004)Google Scholar
  12. 12.
    Kim D., Yeung D.: Design and evaluation of compiler algorithms for pre-execution. ACM SIGPLAN Notices 37(10), 159 (2002)CrossRefGoogle Scholar
  13. 13.
    Kim D., Yeung D.: A study of source-level compiler algorithms for automatic construction of pre-execution code. ACM Trans. Comput. Syst. 22(3), 326–379 (2004)CrossRefGoogle Scholar
  14. 14.
    Lee J.-H., Lee M.-Y., Choi S.-U., Park M.-S.: Reducing cache conflicts in data cache prefetching. ACM SIGARCH Comput. Archit. News 22(4), 71–77 (1994)CrossRefGoogle Scholar
  15. 15.
    Luk C.-K.: Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. ACM SIGARCH Comput. Archit. News 29(2), 40–51 (2001)CrossRefGoogle Scholar
  16. 16.
    Malhotra, V., Kozyrakis, C.: Library-Based Prefetching for Pointer-Intensive Applications. Technical report (2006)Google Scholar
  17. 17.
    Manjikian, N., Abdelrahman, T.S.: Array data layout for the reduction of cache conflicts. In: In Proceedings of the 8th International Conference on Parallel and Distributed Computing Systems (1995)Google Scholar
  18. 18.
    Mutlu, O., Stark, J., Wilkerson, C., Patt, Y.N.: Runahead execution: an alternative to very large instruction windows for out-of-order processors. In: Proceedings of The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003, pp. 129–140. IEEE Computer Society (2003)Google Scholar
  19. 19.
    Nelder J.A., Mead R.: A simplex method for function minimization. Comput. J. 7(4), 308–313 (1965)MATHGoogle Scholar
  20. 20.
    Oliker L., Biswas R.: Parallelization of a dynamic unstructured algorithm using three leading programming paradigms. IEEE Trans. Parallel Distrib. Syst. 11(9), 931–940 (2000)CrossRefGoogle Scholar
  21. 21.
    Perkins, L.S., Andrews, P., Panda, D., Morton, D., Bonica, R., Werstiuk, N., Kreiser, R.: Data intensive computing. In: Proceedings of the 2006 ACM/IEEE conference on Supercomputing—SC ’06, p. 69, ACM Press, New York (2006)Google Scholar
  22. 22.
    Purcell, C., Harris, T.: Non-blocking hashtables with open addressing. In: Proceedings of the 19th International Symposium on Distributed Computing, pp. 108–121, Krakow, Poland. Springer-Verlag GmbH (2005)Google Scholar
  23. 23.
    Ro, W.W., Gaudiot, J.-L.: SPEAR: a hybrid model for speculative pre-execution. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium, pp. 75–84. IEEE (2004)Google Scholar
  24. 24.
    Roth A., Moshovos A., Sohi G.S.: Dependence based prefetching for linked data structures. ACM SIGPLAN Notices 33(11), 115–126 (1998)CrossRefGoogle Scholar
  25. 25.
    Song, Y., Kalogeropulos, S., Tirumalai, P.: Design and implementation of a compiler framework for helper threading on multi-core processors. In: 14th International Conference on Parallel Architectures and Compilation Techniques (PACT’05), pp. 99–109. IEEE Computer Society, Washington, DC (2005)Google Scholar
  26. 26.
    Zhou, J., Cieslewicz, J., Ross, K.A., Shah, M.: Improving database performance on simultaneous multithreading processors. In: Proceedings of the 31st International Conference on Very Large Data Bases, pp. 49–60, Trondheim, Norway (2005)Google Scholar
  27. 27.
    Zilles C., Sohi G.: Execution-based prediction using speculative slices. ACM SIGARCH Comput. Archit. News 29(2), 2–13 (2001)CrossRefGoogle Scholar
  28. 28.
    Zilles C.B., Sohi G.S.: Understanding the backward slices of performance degrading instructions. ACM SIGARCH Comput. Archit. News 28(2), 172–181 (2000)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • Ákos Dudás
    • 1
  • Sándor Juhász
    • 1
  • Tamás Schrádi
    • 1
  1. 1.Department of Automation and Applied InformaticsBudapest University of Technology and EconomicsBudapestHungary

Personalised recommendations