Almost optimal column-wise prefix-sum computation on the GPU

Tokura, Hiroki; Fujita, Toru; Nakano, Koji; Ito, Yasuaki; Bordim, Jacir L.

doi:10.1007/s11227-018-2242-8

Almost optimal column-wise prefix-sum computation on the GPU

Published: 16 January 2018

Volume 74, pages 1510–1521, (2018)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Hiroki Tokura¹,
Toru Fujita¹,
Koji Nakano ORCID: orcid.org/0000-0002-2040-4032¹,
Yasuaki Ito¹ &
…
Jacir L. Bordim²

379 Accesses
5 Citations
Explore all metrics

Abstract

Row-wise and column-wise prefix-sum computation of a matrix has many applications in the area of image processing such as computation of the summed area table and the Euclidean distance map. It is known that the prefix-sums of a one-dimensional array can be computed efficiently on the GPU. Hence, row-wise prefix-sums of a matrix can also be computed efficiently on the GPU by executing this prefix-sum algorithm for every row in parallel. However, the same approach does not work well for computing column-wise prefix-sums due to inefficient stride memory access to the global memory is performed. The main contribution of this paper is to present an almost optimal column-wise prefix-sum algorithm on the GPU. Quite surprisingly, experimental results using NVIDIA TITAN X show that our column-wise prefix-sum algorithm runs only 2–6% slower than matrix duplication. Thus, our column-wise prefix-sum algorithm is almost optimal.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Hwu WW (2011) GPU computing gems, Emerald edn. Morgan Kaufmann, Burlington
Google Scholar
Man D, Uda K, Ueyama H, Ito Y, Nakano K (2011) Implementations of a parallel algorithm for computing Euclidean distance map in multicore processors and GPUs. Int J Netw Comput 1(2):260–276
Article Google Scholar
Takeuchi Y, Takafuji D, Ito Y, Nakano K (2013) ASCII art generation using the local exhaustive search on the GPU. In: Proceedings of International Symposium on Computing and Networking, pp 194–200
NVIDIA Corporation: NVIDIA CUDA C programming guide version 8.0 (2017)
NVIDIA Corporation: NVIDIA CUDA C best practice guide version 3.1 (2010)
Harris M, Sengupta S, Owens JD (2007) Chapter 39. parallel prefix sum (scan) with CUDA. In: GPU Gems 3, Addison-Wesley
Merrill D (2017) CUB: a library of warp-wide, block-wide, and device-wide gpu parallel primitives. https://nvlabs.github.io/cub/
Merrill D, Garland M (2016) Single-pass parallel prefix scan with decoupled look-back. Technical Report NVR-2016-002, NVIDIA
Kasagi A, Nakano K, Ito Y (2014) Parallel algorithms for the summed area table on the asynchronous hierarchical memory machine, with GPU implementations. In: Proceedings of International Conference on Parallel Processing (ICPP), pp 251–250
Lauritzen A (2007) Chapter 8: summed-area variance shadow maps. In: GPU Gems 3, Addison-Wesley
Nehab D, Maximo A, Lima RS, Hoppe H (2011) GPU-efficient recursive filtering and summed-area tables. ACM Trans Gr 30(6):176
Article Google Scholar
Nakano K (2014) Simple memory machine models for GPUs. Int J Parallel Emerg Distrib Syst 29(1):17–37
Article Google Scholar
Nakano K (2012) An optimal parallel prefix-sums algorithm on the memory machine models for GPUs. In: Proceedings of International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP, LNCS 7439), Springer, pp 99–113
Nakano K (2013) Optimal parallel algorithms for computing the sum, the prefix-sums, and the summed area table on the memory machine models. IEICE Trans Inf Syst E96–D(12):2626–2634
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashihiroshima, 739-8527, Japan
Hiroki Tokura, Toru Fujita, Koji Nakano & Yasuaki Ito
Department of Computer Science, University of Brasilia, Brasília, DF, 70910-900, Brazil
Jacir L. Bordim

Authors

Hiroki Tokura
View author publications
You can also search for this author in PubMed Google Scholar
Toru Fujita
View author publications
You can also search for this author in PubMed Google Scholar
Koji Nakano
View author publications
You can also search for this author in PubMed Google Scholar
Yasuaki Ito
View author publications
You can also search for this author in PubMed Google Scholar
Jacir L. Bordim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Koji Nakano.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tokura, H., Fujita, T., Nakano, K. et al. Almost optimal column-wise prefix-sum computation on the GPU. J Supercomput 74, 1510–1521 (2018). https://doi.org/10.1007/s11227-018-2242-8

Download citation

Published: 16 January 2018
Issue Date: April 2018
DOI: https://doi.org/10.1007/s11227-018-2242-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Almost optimal column-wise prefix-sum computation on the GPU

Abstract

Access this article

Similar content being viewed by others

A Practical Fixed-Parameter Algorithm for Constructing Tree-Child Networks from Multiple Binary Trees

Parallelizing the dual revised simplex method

A Ring-Projection-Based Two-Scale Approach for Accurate Digital Image Correlation of Large Translations and Rotations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Almost optimal column-wise prefix-sum computation on the GPU

Abstract

Access this article

Similar content being viewed by others

A Practical Fixed-Parameter Algorithm for Constructing Tree-Child Networks from Multiple Binary Trees

Parallelizing the dual revised simplex method

A Ring-Projection-Based Two-Scale Approach for Accurate Digital Image Correlation of Large Translations and Rotations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation