A GPU-Based Parallel Reduction Implementation

  • Walid Abdala Rfaei JradiEmail author
  • Hugo Alexandre Dantas do Nascimento
  • Wellington Santos Martins
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 1171)


Reduction operations aggregate a finite set of numeric elements into a single value. They are extensively employed in many computational tasks and can be performed in parallel when multiple processing units are available. This work presents a GPU-based approach for parallel reduction, which employs techniques like loop unrolling, persistent threads and algebraic expressions. It avoids thread divergence and it is able to surpass the methods currently in use. Experiments conducted to evaluate the approach show that the strategy performs efficiently on both AMD and NVidia’s hardware platforms, as well as using OpenCL and CUDA, making it portable.


GPU Parallel reduction Portable 


  1. 1.
    Billeter, M., Olsson, O., Assarsson, U.: Efficient stream compaction on wide SIMD many-core architectures. In: Proceedings of the Conference on High Performance Graphics 2009, HPG 2009, pp. 159–166. ACM, New York (2009).
  2. 2.
    Catanzaro, B.: OpenCL optimization case study: simple reductions, August 2014. Published by Advanced Micro Devices. Accessed 05 Jan 2014
  3. 3.
    Chakroun, I., Mezmaz, M., Melab, N., Bendjoudi, A.: Reducing thread divergence in a GPU-accelerated branch-and-bound algorithm. Concurr. Comput.: Pract. Exp. 25(8), 1121–1136 (2013)CrossRefGoogle Scholar
  4. 4.
    Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd edn. The MIT Press, Cambridge (2009)zbMATHGoogle Scholar
  5. 5.
    Fog, A.: Optimizing subroutines in assembly language: an optimization guide for x86 platforms. Technical University of Denmark (2013)Google Scholar
  6. 6.
    Fung, W.W.L., Sham, I., Yuan, G., Aamodt, T.: Dynamic warp formation and scheduling for efficient GPU control flow. In: 2007 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2007, pp. 407–420, December 2007.
  7. 7.
    Goldberg, D.: What every computer scientist should know about floating-pointarithmetic. ACM Comput. Surv. 23(1), 5–48 (1991). Scholar
  8. 8.
    Khronos OpenCL Working Group, et al.: The OpenCL specification. Version 1(29), 8 (2008)Google Scholar
  9. 9.
    Gupta, K., Stuart, J.A., Owens, J.D.: A study of persistent threads style GPU programming for GPGPU workloads. In: Innovative Parallel Computing (InPar), pp. 1–14. IEEE (2012)Google Scholar
  10. 10.
    Han, T.D., Abdelrahman, T.S.: Reducing branch divergence in GPU programs. In: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4, pp. 3:1–3:8. ACM, New York (2011).
  11. 11.
    Harris, M.: Optimizing Parallel Reduction in CUDA (2007). Published by NVidia Corporation. Accessed 10 Sept 2018
  12. 12.
    Higham, N.: Accuracy and Stability of Numerical Algorithms, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia (2002)CrossRefGoogle Scholar
  13. 13.
    Huang, J.C., Leng, T.: Generalized loop-unrolling: a method for program speed-up. In: Proceedings of the IEEE Symposium on Application-Specific Systems and Software Engineering and Technology, pp. 244–248 (1997)Google Scholar
  14. 14.
    Kiefer, J.C.: Sequential minimax search for a maximum. Proc. Am. Math. Soc. 4, 502–506 (1953)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Luitjens, J.: Faster parallel reductions on Kepler. White Paper, February 2014. Published by NVidia Inc., Accessed 25 July 2014
  16. 16.
    Meng, J., Tarjan, D., Skadron, K.: Dynamic warp subdivision for integrated branch and memory divergence tolerance. SIGARCH Comput. Archit. News 38(3), 235–246 (2010). Scholar
  17. 17.
    Muller, J., et al.: Handbook of Floating-Point Arithmetic. Birkhäuser, Boston (2009)Google Scholar
  18. 18.
    Narasiman, V., Shebanow, M., Lee, C.J., Miftakhutdinov, R., Mutlu, O., Patt, Y.N.: Improving GPU performance via large warps and two-level warp scheduling. In: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, pp. 308–317. ACM, New York (2011).
  19. 19.
    Nasre, R., Burtscher, M., Pingali, K.: Data-driven versus topology-driven irregular computations on GPUs. In: 2013 IEEE 27th International Symposium on Parallel Distributed Processing (IPDPS), pp. 463–474 (2013).
  20. 20.
    Parhami, B.: Introduction to Parallel Processing Algorithms and Architectures. Plenum Series in Computer Science. Plenum Press, London (1999). Scholar
  21. 21.
    Sarkar, V.: Optimized unrolling of nested loops. Int. J. Parallel Program. 29(5), 545–581 (2001). Scholar
  22. 22.
    Steinberger, M., Kainz, B., Kerbl, B., Hauswiesner, S., Kenzel, M., Schmalstieg, D.: Softshell: dynamic scheduling on GPUs. ACM Trans. Graph. 31(6), 161:1–161:11 (2012). Scholar
  23. 23.
    Wilt, N.: The CUDA Handbook: A Comprehensive Guide to GPU Programming. Pearson Education, White Plains (2013)Google Scholar
  24. 24.
    Zhang, E.Z., Jiang, Y., Guo, Z., Shen, X.: Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping. In: Proceedings of the 24th ACM International Conference on Supercomputing, ICS 2010, pp. 115–126. ACM, New York (2010).

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Instituto de InformáticaUniversidade Federal de GoiásGoiâniaBrazil

Personalised recommendations