Multimedia Tools and Applications

, Volume 68, Issue 2, pp 237–251 | Cite as

Optimizing image processing on multi-core CPUs with Intel parallel programming technologies

  • Cheong Ghil Kim
  • Jeom Goo Kim
  • Do Hyeon Lee


The rapid advance of computer hardware and popularity of multimedia applications enable multi-core processors with sub-word parallelism instructions to become a dominant market trend in desk-top PCs as well as high end mobile devices. This paper presents an efficient parallel implementation of 2D convolution algorithm demanding high performance computing power in multi-core desktop PCs. It is a representative computation intensive algorithm, in image and signal processing applications, accompanied by heavy memory access; on the other hand, their computational complexities are relatively low. The purpose of this study is to explore the effectiveness of exploiting the streaming SIMD (Single Instruction Multiple Data) extension (SSE) technology and TBB (Threading Building Block) run-time library in Intel multi-core processors. By doing so, we can take advantage of all the hardware features of multi-core processor concurrently for data- and task-level parallelism. For the performance evaluation, we implemented a 3 × 3 kernel based convolution algorithm using SSE2 and TBB with different combinations and compared their processing speeds. The experimental results show that both technologies have a significant effect on the performance and the processing speed can be greatly improved when using two technologies at the same time; for example, 6.2, 6.1, and 1.4 times speedup compared with the implementation of either of them are suggested for 256 × 256, 512 × 512, and 1024 × 1024 data sets, respectively.


Multi-core Streaming SIMD extension Threading building block Sobel operator Sub-word parallelism Task-level parallelism Multimedia 



Funding for this paper was provided by Namseoul University.


  1. 1.
    Akhter S, Roberts J (2006) Multi-core programming: increasing performance through software multi-threading. Intel PressGoogle Scholar
  2. 2.
    Baker CG, Carter Edwards H, Heroux MA, Williams AB (2010) A light-weight api for portable multicore programming. In Proceedings of the 18th Euromicro International Conference on Parallel, Distributed and Network-Based Computing, Washington, DC, USA, 2010Google Scholar
  3. 3.
    Bosi B, Bois G, Savaria Y (1999) Reconfigurable pipelined 2D convolvers for fast digital signal processing. IEEE Transactions on VLSI Systems 7(3):299–308CrossRefGoogle Scholar
  4. 4.
    Chhugani J, Macy M, Baransi A, Nguyen AD, Hagog M, Kumar S, Lee VW, Chen YK (2008) Efficient implementation of sorting on multi-core SIMD CPU architecture. Pradeep Dubey Journal: Proceedings of the VLDB Endowment 1(2):1313–1324Google Scholar
  5. 5.
    Contreras G, Martonosi M (2008) Characterizing and improving the performance of Intel threading building blocks. International Symposium on Workload Characterization (IISWC'08), September 2008. pp 1–10Google Scholar
  6. 6.
    David M, Vasco S, Martin MD, Ken R, Peter C (2009) Digital signal processing on Intel architecture. Intel PressGoogle Scholar
  7. 7.
    Diefendorff K, Dubey PK, Hochsprung R, Scale H (2000) AltiVec extension to PowerPC accelerates media processing. IEEE Micro 20(2):85–95CrossRefGoogle Scholar
  8. 8.
    Falcou J, Sérot J, Chateau T, Lapresté J-T (2006) Quaff: efficient C++ design for parallel skeletons. Parallel Computing 32(7–8):604–615CrossRefGoogle Scholar
  9. 9.
    Gonzalez R, Woods R (2002) Digital image processing, 2nd edn. Prentice-Hall, Englewood CliffsGoogle Scholar
  10. 10.
    Hecht V, Rönner K, Pirsch P (1991) An advanced programmable 2D convolution chip for real time image processing. In Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS), pp 1897–1900Google Scholar
  11. 11.
    Hennessy JL, Patterson DA (2003) Computer architecture: a quantitative approach, 3rd edn. Morgan-KaufmannGoogle Scholar
  12. 12.
    Kayi A, Yao Y, El-Ghazawi T, Newby G (2007) Experimental evaluation of emerging multi-core architectures. In Proceeding of IPDPS 2007:1–6Google Scholar
  13. 13.
    Kim WY, Voss M (2011) Multicore desktop programming with Intel threading building blocks. IEEE Softw 2011:23–31CrossRefGoogle Scholar
  14. 14.
    Kim CG, Kim HS, Kang SH, Kim SD, Han GH (2004) An acceleration processor for data intensive scientific computing. IEICE Trans Inf Syst E87-D:1766–1773Google Scholar
  15. 15.
    Kirschenmann W, Plagne L, Vialle S (2010) Multi-target vectorization with MTPS C++ generic library. In PARA 2010: State of the Art in Scientific and Parallel Computing, Iceland Reykjavik, June 2010Google Scholar
  16. 16.
    Kohn L, Maturana G, Tremblay M, Prabhu A, Zyner G (1995) The visual instruction set (VIS) in UltraSPARC (Compcon 95). Technologies for the Information Superhighway, Digest of Papers, pp 462–469Google Scholar
  17. 17.
    Lee RB, Fiskiran AM (2002) Multimedia instructions in microprocessors for native signal processing. Programmable Digital Signal Processors: Architecture, Programming, and Applications, Marcel Dekker, pp 91–145Google Scholar
  18. 18.
    Ma WC, Yang CL (2002) Using intel streaming SIMD extensions for 3D geometry processing. Proceedings of the 3rd IEEE Pacific-Rim Conference on Multimedia ProcessingGoogle Scholar
  19. 19.
    Nicole R (2001) Desktop performance and optimization for Intel® Pentium® 4 Processor, founded at
  20. 20.
    Oberman S, Favor G, Weber F (1999) AMD 3D now! Technology: architecture and implementations. IEEE Micro 19(2):37–48CrossRefGoogle Scholar
  21. 21.
    Paxson V, Sommer R, Weaver N (2007) An architecture for exploiting multi-core processors to parallelize network intrusion prevention. In Proceeding of IEEE Sarnoff Symposium 2007:1–7Google Scholar
  22. 22.
    Peleg A, Weiser U (1996) MMX technology extension to the Intel architecture. IEEE Micro 16(4):42–50CrossRefGoogle Scholar
  23. 23.
    Perria S, Lanuzzaa M, Corsonellob P, Cocorulloa G (2005) A high-performance fully reconfigurable FPGA-based 2D convolution processor. Microprocessors and Microsystems 29:381–391CrossRefGoogle Scholar
  24. 24.
    Reinders J (2007) Intel threading building blocks. O’Reilly, SebastopolGoogle Scholar
  25. 25.
    Robison A, Voss M, Kukanov A (2008) Optimization via reflection on work stealing in TBB. IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2008), pp 1–8Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • Cheong Ghil Kim
    • 1
  • Jeom Goo Kim
    • 1
  • Do Hyeon Lee
    • 2
  1. 1.Department of Computer ScienceNamseoul UniversityCheonan-cityKorea
  2. 2.IT Convergence Technology Research & Education CenterNamseoul UniversityCheonan-cityKorea

Personalised recommendations