Spill Code Placement for SIMD Machines

  • Diogo Nunes Sampaio
  • Elie Gedeon
  • Fernando Magno Quintão Pereira
  • Sylvain Collange
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7554)


The Single Instruction, Multiple Data (SIMD) execution model has been receiving renewed attention recently. This awareness stems from the rise of graphics processing units (GPUs) as a powerful alternative for parallel computing. Many compiler optimizations have been recently proposed for this hardware, but register allocation is a field yet to be explored. In this context, this paper describes a register spiller for SIMD machines that capitalizes on the opportunity to share identical data between threads. It provides two different benefits: first, it uses less memory, as more spilled values are shared among threads. Second, it improves the access times to spilled values. We have implemented our proposed allocator in the Ocelot open source compiler, and have been able to speedup the code produced by this framework by 21%. Although we have designed our algorithm on top of a linear scan register allocator, we claim that our ideas can be easily adapted to fit the necessities of other register allocators.


Shared Memory Local Memory Global Memory Register Allocation Compiler Optimization 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aho, A.V., Lam, M.S., Sethi, R., Ullman, J.D.: Compilers: Principles, Techniques, and Tools, 2nd edn. Addison Wesley (2006)Google Scholar
  2. 2.
    Backus, J.: The history of fortran i, ii, and iii. SIGPLAN Not. 13(8), 165–180 (1978)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Baghsorkhi, S.S., Delahaye, M., Patel, S.J., Gropp, W.D., Hwu, W.M.W.: An adaptive performance modeling tool for GPU architectures. In: PPoPP, pp. 105–114. ACM (2010)Google Scholar
  4. 4.
    Belady, L.A.: A study of replacement algorithms for a virtual storage computer. IBM Systems Journal 5(2), 78–101 (1966)CrossRefGoogle Scholar
  5. 5.
    Bouchez, F.: Allocation de Registres et Vidage en Mémoire. Master’s thesis, ENS Lyon (October 2005)Google Scholar
  6. 6.
    Briggs, P., Cooper, K.D., Torczon, L.: Rematerialization. In: PLDI, pp. 311–321. ACM (1992)Google Scholar
  7. 7.
    Carrillo, S., Siegel, J., Li, X.: A control-structure splitting optimization for GPGPU. In: Computing Frontiers, pp. 147–150. ACM (2009)Google Scholar
  8. 8.
    Chaitin, G.J., Auslander, M.A., Chandra, A.K., Cocke, J., Hopkins, M.E., Markstein, P.W.: Register allocation via coloring. Computer Languages 6, 47–57 (1981)CrossRefGoogle Scholar
  9. 9.
    Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.H., Skadron, K.: Rodinia: A benchmark suite for heterogeneous computing. In: IISWC, pp. 44–54. IEEE (2009)Google Scholar
  10. 10.
    Coutinho, B., Sampaio, D., Pereira, F.M.Q., Meira, W.: Divergence analysis and optimizations. In: PACT. IEEE (2011)Google Scholar
  11. 11.
    Cytron, R., Ferrante, J., Rosen, B.K., Wegman, M.N., Zadeck, F.K.: Efficiently computing static single assignment form and the control dependence graph. TOPLAS 13(4), 451–490 (1991)CrossRefGoogle Scholar
  12. 12.
    Diamos, G., Kerr, A., Yalamanchili, S., Clark, N.: Ocelot, a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In: PACT, pp. 354–364 (2010)Google Scholar
  13. 13.
    Farach-colton, M., Liberatore, V.: On local register allocation. Journal of Algorithms 37(1), 37–65 (2000)MathSciNetzbMATHCrossRefGoogle Scholar
  14. 14.
    Garland, M.: Parallel computing experiences with CUDA. IEEE Micro 28, 13–27 (2008)CrossRefGoogle Scholar
  15. 15.
    Garland, M., Kirk, D.B.: Understanding throughput-oriented architectures. Commun. ACM 53, 58–66 (2010)CrossRefGoogle Scholar
  16. 16.
    Golumbic, M.C.: Algorithmic Graph Theory and Perfect Graphs, 1st edn. Elsevier (2004)Google Scholar
  17. 17.
    Han, T.D., Abdelrahman, T.S.: Reducing branch divergence in GPU programs. In: GPGPU-4, pp. 3:1–3:8. ACM (2011)Google Scholar
  18. 18.
    Harris, M.: The parallel prefix sum (scan) with CUDA. Tech. Rep. Initial release on February 14, 2007, NVIDIA (2008)Google Scholar
  19. 19.
    Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., Singhal, R., Dubey, P.: Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In: ISCA, pp. 451–460. ACM (2010)Google Scholar
  20. 20.
    Nickolls, J., Dally, W.J.: The GPU computing era. IEEE Micro 30, 56–69 (2010)CrossRefGoogle Scholar
  21. 21.
    Nickolls, J., Kirk, D.: Graphics and Computing GPUs. In: Patterson, Hennessy (eds.) Computer Organization and Design, 4th edn., ch. A, pp. A.1–A.77. Elsevier (2009)Google Scholar
  22. 22.
    Pereira, F.M.Q., Palsberg, J.: Register Allocation After Classical SSA Elimination is NP-Complete. In: Aceto, L., Ingólfsdóttir, A. (eds.) FOSSACS 2006. LNCS, vol. 3921, pp. 79–93. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  23. 23.
    Poletto, M., Sarkar, V.: Linear scan register allocation. TOPLAS 21(5), 895–913 (1999)CrossRefGoogle Scholar
  24. 24.
    Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B., Hwu, W.M.W.: Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: PPoPP, pp. 73–82. ACM (2008)Google Scholar
  25. 25.
    Sampaio, D., Martins, R., Collange, S., Pereira, F.M.Q.: Divergence analysis with affine constraints. Tech. rep., École normale supérieure de Lyon (2011)Google Scholar
  26. 26.
    Sethi, R.: Complete register allocation problems. In: 5th annual ACM Symposium on Theory of Computing, pp. 182–195. ACM Press (1973)Google Scholar
  27. 27.
    Sreedhar, V.C., Gao, G.R.: A linear time algorithm for placing φ-nodes. In: POPL, pp. 62–73. ACM (1995)Google Scholar
  28. 28.
    Wegman, M.N., Zadeck, F.K.: Constant propagation with conditional branches. TOPLAS 13(2) (1991)Google Scholar
  29. 29.
    Zhang, E.Z., Jiang, Y., Guo, Z., Tian, K., Shen, X.: On-the-fly elimination of dynamic irregularities for GPU computing. In: ASPLOS, pp. 369–380. ACM (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Diogo Nunes Sampaio
    • 1
  • Elie Gedeon
    • 1
  • Fernando Magno Quintão Pereira
    • 1
  • Sylvain Collange
    • 1
  1. 1.Departamento de Ciência da ComputaçãoUFMGBrazil

Personalised recommendations