The Journal of Supercomputing

, Volume 74, Issue 4, pp 1510–1521 | Cite as

Almost optimal column-wise prefix-sum computation on the GPU

  • Hiroki Tokura
  • Toru Fujita
  • Koji Nakano
  • Yasuaki Ito
  • Jacir L. Bordim


Row-wise and column-wise prefix-sum computation of a matrix has many applications in the area of image processing such as computation of the summed area table and the Euclidean distance map. It is known that the prefix-sums of a one-dimensional array can be computed efficiently on the GPU. Hence, row-wise prefix-sums of a matrix can also be computed efficiently on the GPU by executing this prefix-sum algorithm for every row in parallel. However, the same approach does not work well for computing column-wise prefix-sums due to inefficient stride memory access to the global memory is performed. The main contribution of this paper is to present an almost optimal column-wise prefix-sum algorithm on the GPU. Quite surprisingly, experimental results using NVIDIA TITAN X show that our column-wise prefix-sum algorithm runs only 2–6% slower than matrix duplication. Thus, our column-wise prefix-sum algorithm is almost optimal.


Prefix computation Parallel algorithm GPU CUDA 


  1. 1.
    Hwu WW (2011) GPU computing gems, Emerald edn. Morgan Kaufmann, BurlingtonGoogle Scholar
  2. 2.
    Man D, Uda K, Ueyama H, Ito Y, Nakano K (2011) Implementations of a parallel algorithm for computing Euclidean distance map in multicore processors and GPUs. Int J Netw Comput 1(2):260–276CrossRefGoogle Scholar
  3. 3.
    Takeuchi Y, Takafuji D, Ito Y, Nakano K (2013) ASCII art generation using the local exhaustive search on the GPU. In: Proceedings of International Symposium on Computing and Networking, pp 194–200Google Scholar
  4. 4.
    NVIDIA Corporation: NVIDIA CUDA C programming guide version 8.0 (2017)Google Scholar
  5. 5.
    NVIDIA Corporation: NVIDIA CUDA C best practice guide version 3.1 (2010)Google Scholar
  6. 6.
    Harris M, Sengupta S, Owens JD (2007) Chapter 39. parallel prefix sum (scan) with CUDA. In: GPU Gems 3, Addison-WesleyGoogle Scholar
  7. 7.
    Merrill D (2017) CUB: a library of warp-wide, block-wide, and device-wide gpu parallel primitives.
  8. 8.
    Merrill D, Garland M (2016) Single-pass parallel prefix scan with decoupled look-back. Technical Report NVR-2016-002, NVIDIAGoogle Scholar
  9. 9.
    Kasagi A, Nakano K, Ito Y (2014) Parallel algorithms for the summed area table on the asynchronous hierarchical memory machine, with GPU implementations. In: Proceedings of International Conference on Parallel Processing (ICPP), pp 251–250Google Scholar
  10. 10.
    Lauritzen A (2007) Chapter 8: summed-area variance shadow maps. In: GPU Gems 3, Addison-WesleyGoogle Scholar
  11. 11.
    Nehab D, Maximo A, Lima RS, Hoppe H (2011) GPU-efficient recursive filtering and summed-area tables. ACM Trans Gr 30(6):176CrossRefGoogle Scholar
  12. 12.
    Nakano K (2014) Simple memory machine models for GPUs. Int J Parallel Emerg Distrib Syst 29(1):17–37CrossRefGoogle Scholar
  13. 13.
    Nakano K (2012) An optimal parallel prefix-sums algorithm on the memory machine models for GPUs. In: Proceedings of International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP, LNCS 7439), Springer, pp 99–113Google Scholar
  14. 14.
    Nakano K (2013) Optimal parallel algorithms for computing the sum, the prefix-sums, and the summed area table on the memory machine models. IEICE Trans Inf Syst E96–D(12):2626–2634CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Information EngineeringHiroshima UniversityHigashihiroshimaJapan
  2. 2.Department of Computer ScienceUniversity of BrasiliaBrasíliaBrazil

Personalised recommendations