Abstract
This paper reports the adaptation of the Multidimensional Multiscale Parser (MMP) algorithm to CUDA. Specifically, we focus on memory optimization issues, such as the layout of data structures in memory, the type of GPU memory – shared, constant and global – and on achieving coalesced accesses. MMP is a demanding lossy compression algorithm for images. For example, MMP requires nearly 9000 seconds to encode the 512 ×512 Lenna image on a 2013’s Intel Xeon. One of the main challenges to adapt MMP to manycore is related to the dependency over a pattern codebook which is built during the execution. This forces the input image to be processed sequentially. Nonetheless, CUDA-MMP achieves a 12× speedup over the sequential version when ran on an NVIDIA GTX 680. By further optimizing memory operations, the speedup is pushed to 17.1×.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Seltzer, M.L., Zhang, L.: The data deluge: Challenges and opportunities of unlimited data in statistical signal processing. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2009, pp. 3701–3704. IEEE (2009)
Murakami, T.: The development and standardization of ultra high definition video technology. In: Mrak, M., Grgic, M., Kunt, M. (eds.) High-Quality Visual Experience. Signals and Communication Technology, pp. 81–135. Springer, Heidelberg (2010)
Coughlin, T.: Evolving Storage Technology in Consumer Electronic Products (The Art of Storage). IEEE Consumer Electronics Magazine 2(2), 59–63 (2013)
De Carvalho, M.B., Da Silva, E.A., Finamore, W.A.: Multidimensional signal compression using multiscale recurrent patterns. Signal Processing 82(11), 1559–1580 (2002)
Rodrigues, N.M., da Silva, E.A., de Carvalho, M.B., de Faria, S.M., da Silva, V.M.M.: On dictionary adaptation for recurrent pattern image coding. IEEE Transactions on Image Processing 17(9), 1640–1653 (2008)
De Simone, F., Ouaret, M., Dufaux, F., Tescher, A.G., Ebrahimi, T.: A comparative study of JPEG2000, AVC/H.264, and HD photo, vol. 6696, pp. 669602–669602–12 (2007)
Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., Singhal, R., Dubey, P.: Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. SIGARCH Comput. Archit. News 38, 451–460 (2010)
de Verdiére, G.C.: Introduction to GPGPU, a hardware and software background. Comptes Rendus Mécanique 339(23), 78–89 (2011); High Performance Computing Le Calcul Intensif
Farber, R.: CUDA Application Design and Development. Morgan Kaufmann (2011)
Stone, J.E., Gohara, D., Shi, G.: Opencl: A parallel programming standard for heterogeneous computing systems. Computing in Science and Engineering 12(3), 66–73 (2010)
Ozsoy, A., Swany, M., Chauhan, A.: Optimizing LZSS compression on GPGPUs. Future Generation Computer Systems 30, 170–178 (2014); Special Issue on Extreme Scale Parallel Architectures and Systems. In: Cryptography in Cloud Computing and Recent Advances in Parallel and Distributed Systems, ICPADS 2012 Selected Papers
Sodsong, W., Hong, J., Chung, S., Lim, Y., Kim, S.D., Burgstaller, B.: Dynamic partitioning-based jpeg decompression on heterogeneous multicore architectures. In: Proceedings of Programming Models and Applications on Multicores and Manycores, PMAM 2014, pp. 80:80–80:91. ACM, New York (2007)
Sung, I.J., Stratton, J.A., Hwu, W.M.W.: Data layout transformation exploiting memory-level parallelism in structured grid many-core applications. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT 2010, pp. 513–522. ACM, New York (2010)
Jaeger, J., Barthou, D.: et al.: Automatic efficient data layout for multithreaded stencil codes on CPUs and GPUs. In: IEEE Proceedings of High Performance Computing Conference, pp. 1–10 (2012)
Sung, I., Liu, G., Hwu, W.: DL: A data layout transformation system for heterogeneous computing. In: Innovative Parallel Computing (InPar), pp. 1–11. IEEE (2012)
Mei, G., Tian, H.: Performance Impact of Data Layout on the GPU-accelerated IDW Interpolation. ArXiv e-prints (February 2014)
Stamatopoulos, C., Chuang, T.Y., Fraser, C.S., Lu, Y.Y.: Fully automated image orientation in the absence of targets. ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XXXIX-B5, 303–308 (2012)
Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable Parallel Programming with CUDA. Queue 6, 40–53 (2008)
Nvidia, C.: NVIDIA CUDA Programming Guide (version 5.5). NVIDIA Corporation (2013)
Wilt, N.: The CUDA Handbook: A Comprehensive Guide to GPU Programming. Pearson Education (2013)
Nvidia, C.: NVIDIA CUDA C Best Practices Guide - CUDA Toolkit v5.5. NVIDIA Corporation (2013)
Kirk, D.B., Wen-mei, W.H.: Programming Massively Parallel Processors: a Hands-on Approach, 2nd edn. Newnes (2012)
Harris, M., et al.: Optimizing parallel reduction in CUDA. NVIDIA Developer Technology 2, 45 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Domingues, P., Silva, J., Ribeiro, T., Rodrigues, N.M.M., De Carvalho, M.B., De Faria, S.M.M. (2014). Optimizing Memory Usage and Accesses on CUDA-Based Recurrent Pattern Matching Image Compression. In: Murgante, B., et al. Computational Science and Its Applications – ICCSA 2014. ICCSA 2014. Lecture Notes in Computer Science, vol 8582. Springer, Cham. https://doi.org/10.1007/978-3-319-09147-1_41
Download citation
DOI: https://doi.org/10.1007/978-3-319-09147-1_41
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09146-4
Online ISBN: 978-3-319-09147-1
eBook Packages: Computer ScienceComputer Science (R0)