Background and Related Work

  • Muhammad Usman Karim Khan
  • Muhammad Shafique
  • Jörg Henkel


This chapter discusses the basics of video processing in general, while specifically targeting the video coding applications. General video system design and its memory access patterns and resource utilization are deliberated. Fundamentals of HEVC and H.264/AVC video encoding are followed by their associated challenges when designing computationally efficient video processing systems. Modern technological challenges that arise in deploying video systems are also presented in this chapter. Afterwards, the state-of-the-art techniques to meet these design challenges are discussed, with details targeting video processing system’s software and hardware layers.


  1. 1.
    Bjontegaard, G. (2001). Calculation of average PSNR differences between RD-curves. VCEG Contribution VCEG-M33.Google Scholar
  2. 2.
    Ostermann, J., Bormans, J., List, P., Marpe, D., Narroschke, M., Pereira, F., Stockhammer, T., & Wedi, T. (2004). Video coding with H.264/AVC: Tools, performance, and complexity. IEEE Circuits and Systems Magazine, 4(1), 7–28.CrossRefGoogle Scholar
  3. 3.
    Sullivan, G. J., Ohm, J., Han, W., & Wiegand, T. (2012). Overview of high efficiency video coding. IEEE Transactions on Circuits and Systems for Video Technology, 22(12), 1649–1668.CrossRefGoogle Scholar
  4. 4.
    Wiegand, T., Sullivan, G., Bjontegaard, G., & Luthra, A. (2003). Overview of the H.264/AVC video coding standard. IEEE Transactions on Circuits and Systems for Video Technology, 13(7), 560–576.CrossRefGoogle Scholar
  5. 5.
    Chen, C., Huang, C., Chen, Y., & Chen, L. (2006). Level C+ data reuse scheme for motion estimation with corresponding coding orders. IEEE Transactions on Circuits and Systems for Video Technology, 16(4), 553–558.CrossRefGoogle Scholar
  6. 6.
    Zhu, S., & Ma, K.-K. (2000). A new diamond search algorithm for fast block-matching motion estimation. IEEE Transactions on Image Processing, 9(2), 287–290.CrossRefGoogle Scholar
  7. 7.
    Bossen, F., Bross, B., Suhring, K., & Flynn, D. (2012). HEVC complexity and implementation analysis. IEEE Transactions on Circuits and Systems for Video Technology, 22(12), 1685–1696.CrossRefGoogle Scholar
  8. 8.
    Sze, V., Finchelstein, D. F., Sinangil, M. E., & Chandraksan, A. P. (2009). A 0.7-V 1.8-mW H.264/AVC 720p video decoder. IEEE Journal of Solid-Sate Circuits, 44(11), 2943–2956.CrossRefGoogle Scholar
  9. 9.
    Sampaio, F., Shafique, M., Zatt, B., Bampi, S., & Henkel, J. (2014). Energy-efficient architecture for advanced video memory. In International Conference on Computer-Aided Design.Google Scholar
  10. 10.
    Purnachand, N., Alves, L. N., & Navarro, A. (2012). Improvements to TZ search motion estimation algorithm for multiview video cod-ing. In IEEE International Confernce on Systems, Signals and Image Processing (IWSSIP), pp. 388–391.Google Scholar
  11. 11.
    Shafique, M., Bauer, L., & Henkel, J. (2010). enBudget: A run-time adaptive predictive energy-budgeting scheme for energy-aware motion estimation in H.264/MPEG-4 AVC video encoder. In Design, Automation and Test in Europe.Google Scholar
  12. 12.
    Gurhanli, A., Chen, C.-P., & Hung, S.-H. (2010). GOP-level parallelization of the H.264 decoder without a start-code scanner. In International Conference on Signal Processing Systems (ICSPS).Google Scholar
  13. 13.
    VideoLAN - x264. [Online]. Available: Accessed 5 Oct 2015.
  14. 14.
    Zhao, L., Xu, J., Zhou, Y., & Ai, M. (2012). A dynamic slice control scheme for slice-parallel video encoding. In International Conference on Image Processing.Google Scholar
  15. 15.
    Ba, K., Jin, X., & Goto, S. (2010). A dynamic slice-resize algorithm for fast H.264/AVC parallel encoder. In International Symposium on Intelligent Signal Processing and Communication Systems.Google Scholar
  16. 16.
    Khan, M. U. K., Shafique, M., & Henkel, J. (2014). Software architecture of high efficiency video coding for many-core systems with power-efficient workload balancing. In Design, Automation and Test in Europe.Google Scholar
  17. 17.
    Ahmad, I., & Ghafoor, A. (1991). Semi-distributed load balancing for massively parallel multicomputer systems. IEEE Transactions on Software Engineering, 17(10), 987–1004.MathSciNetCrossRefGoogle Scholar
  18. 18.
    Williams, R. (1991). Performance of dynamic load balancing algorithms for unstructured mesh calculations. Concurrency: Practice and experience, 3(5), 457–481.CrossRefGoogle Scholar
  19. 19.
    Juice Encoder– 4 in 1 MPEG-4 AVC/H.264 HD encoder. Antik Technology, [Online]. Available:
  20. 20.
    Marvell 88DE3100 High-Definition Secure Media Processor System-on-Chip (SoC). [Online]. Available:
  21. 21.
    Distributed Coding for Video Services (DISCOVER). Application scenarios and functionalities for DVC.Google Scholar
  22. 22.
    Wyner, A., & Ziv, J. (1976). The rate-distortion function for source coding with side information at the decoder. IEEE Transaction on Information Theory, 22, 1–10.MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
    Girod, B., Aaron, A. M., Rane, S., & Rebollo-Monedero, D. (2005). Distributed Video Coding. Proceedings of the IEEE, 93(1), 71–83.CrossRefzbMATHGoogle Scholar
  24. 24.
    Puri, R., & Ramchandran, K. (2003). PRISM: An uplink-friendly multimedia coding paradigm. In International Conference on Acoustics, Speech, and Signal Processing.Google Scholar
  25. 25.
    Chen, J., Khisti, A., Malioutov, D., & Yedidia, J. (2004). Distributed source coding using serially-concatenated-accumulate codes. In Information Theory Workshop.Google Scholar
  26. 26.
    Tseng, H.-Y., Shen, Y.-C., & Wu, J.-L. (2011). Distributed video coding with compressive measurements. In International conference on Multimedia.Google Scholar
  27. 27.
    Sejdinovic, D., Piechocki, R. J., & Doufexi, A. (2009). Rateless distributed source code design. In Mobile Multimedia Communica-tions Conference.Google Scholar
  28. 28.
    Chien, S.-Y., Cheng, T.-Y., Chiu, C.-C., Tsung, P.-K., Lee, C.-H., Somayazulu, V., & Chen, Y.-K. (2012). Power optimization of wireless video sensor nodes in M2M networks. In Asia and South Pacific Design Automation Conference.Google Scholar
  29. 29.
    Huang, Y.-W., Chen, T.-C., Tsai, C.-H., Chen, C.-Y., Chen, T.-W., Chen, C.-S., Shen, C.-F., Ma, S.-Y., Wang, T.-C., Hsieh, B.-Y., Fang, H.-C., & Chen, L.-G. (2005). A 1.3TOPS H.264/AVC single-chip en-coder for HDTV applications. In International Solid-State Circuits Conference.Google Scholar
  30. 30.
    Chiu, C.-C., Chien, S.-Y., Lee, C.-H., Somayazulu, V., & Chen, Y.-K.. (2011). Distributed video coding: A promising solution for distributed wireless video sensors or not?. In Visual Communications and Image Processing.Google Scholar
  31. 31.
    Shafique, M., Khan, M. U. K., & Henkel, J. (2013). Content-driven adaptive computation offloading for energy-aware hybrid distributed video coding. In International Symposium on Low Power Electronics and Design (ISLPED).Google Scholar
  32. 32.
    Kumar, K., Liu, J., Lu, Y.-H., & Bhargava, B. (2012). A survey of computation offloading for mobile systems. Mobile Networks and Applications, 18(1), 129–140.CrossRefGoogle Scholar
  33. 33.
    Colin, A., Kandhalu, A., & Rajkumar, R. (2015). Energy-efficient allocation of real-time applications onto single-ISA heterogeneous multi-core processors. Journal of Signal Processing Systems, pp. 1–20.Google Scholar
  34. 34.
    Schroder, D. K., & Babcock, J. A. (2003). Negative bias temperature instability: Road to cross in deep submicron silicon semiconductor manufacturing. Journal of Applied Physics, 94(1), 1–18.CrossRefGoogle Scholar
  35. 35.
    Shin, J., Zyuban, V., Bose, P., & Pinkston, T. (2008). A proactive wearout recovery approach for exploiting microarchi-tectural redundancy to extend cache SRAM lifetime. In International Symposium on Computer Architecture (ISCA).Google Scholar
  36. 36.
    Tiwari, A., & Torrellas, J. (2008). Facelift: Hiding and slowing down aging in multicores. In International Symposium on Microarchitecture (MICRO).Google Scholar
  37. 37.
    Vattikonda, R., Wang, W., & Cao, Y. (2006). Modeling and minimization of PMOS NBTI effect for robust nanometer design. In Design Automation Conference (DAC).Google Scholar
  38. 38.
    Velamala, J. B., Sutaria, K., Sato, T., & Cao, Y. (2012). Physics matters: Statistical aging prediction under trapping/detrapping. In Design Automation Conference (DAC).Google Scholar
  39. 39.
    Singh, A., Shafique, M., Kumar, A., & Henkel, J. (2013). Mapping on multi/many-core systems: Survey of current and emerging trends. In Design Automation Conference (DAC).Google Scholar
  40. 40.
    Chi, C. C., Alvarez-Mesa, M., Juurlink, B., Clare, G., Henry, F., Pateux, S., & Schierl, T. (2012). Parallel scalability and efficiency of HEVC parallelization approaches. IEEE Transactions on Circuits and Systems on Video Technology, 22(12), 1827–1838.CrossRefGoogle Scholar
  41. 41.
    Alvanos, M., Tzenakis, G., Nikolopoulos, D. S., & Bilas, A. (2011). Task-based parallel H. 264 video encoding for explicit communication architectures. In International Conference on Embedded Computer Systems.Google Scholar
  42. 42.
    Brun, O., Teuliere, V., & Garcia, J. M. (2002). Parallel particle filtering. Journal of Parallel and Distributed Computing, 62(7), 1186–1202.CrossRefzbMATHGoogle Scholar
  43. 43.
    Rujirakul, K., So-In, C., & Arnonkijpanich, B. (2014). PEM-PCA: A parallel expectation-maximization PCA face recognition architecture. The Scientific World Journal. Google Scholar
  44. 44.
    Jing, X.-Y., Li, S., Zhang, D., Yang, J., & Yang, J.-Y. (2012). Supervised and unsupervised parallel subspace learning for large-scale image recognition. IEEE Transactions on Circuits and Systems for Video Technology, 22(10), 1497–1511.CrossRefGoogle Scholar
  45. 45.
    Dong, C., Zhao, H., & Wang, W. (2010). Parallel nonnegative matrix factorization algorithm on the distributed memory platform. International Journal of Parallel Programming, 38(2), 117–137.CrossRefzbMATHGoogle Scholar
  46. 46.
    Shah, S., & Kothari, R. (2013). Convergence of the dynamic load balancing problem to Nash equilibrium using distributed local interactions. Information Sciences, 221, 297–305.CrossRefGoogle Scholar
  47. 47.
    Drougas, Y., Repantis, T., & Kalogeraki, V. (2006). Load balancing techniques for distributed stream processing applications in overlay environments. In IEEE International Symposium on Object and Component-Oriented Real-Time Distributed Computing.Google Scholar
  48. 48.
    Robertazzi, T. G. (2003). Ten reasons to use divisible load theory. Computer, 36(5), 63–68.CrossRefGoogle Scholar
  49. 49.
    Wang, K., Zhou, X., Li, T., Zhao, D., Lang, M., & Raicu, I. (2014). Optimizing load balancing and data-locality with data-aware scheduling. In International Conference on Big Data.Google Scholar
  50. 50.
    Turakhia, Y., Raghunathan, B., Garg, S., & Marculescu, D. (2013). HaDeS: Architectural synthesis for heterogeneous dark silicon chip multi-processors. In Design Automation Conference (DAC).Google Scholar
  51. 51.
    Buss, M., Givargis, T., & Dutt, N. (2003). Exploring efficient operating points for voltage scaled embedded processor cores. In Real-Time Systems Symposium (RTSS).Google Scholar
  52. 52.
    Rosas, C. Morajko, A. Jorba, J., & Cesar, E. (2011). Workload balancing methodology for data-intensive applications with divisible load. In Symposium on Computer Architecture and High Performance Computing.Google Scholar
  53. 53.
    Matthur, A., & Mundur, P. (2003). Dynamic load balancing across mirrored multimedia servers. In International Conference on Multimedia and Expo.Google Scholar
  54. 54.
    Bowman, K., Duvall, S., & Meindl, J. (2002). Impact of die-to-die and within-die parameter fluctuations on the maximum clock frequency distribution for gigascale integration. IEEE Journal of Solid-State Circuits, 37(2), 183–190.CrossRefGoogle Scholar
  55. 55.
    Kim, J., Yoo, S., & Kyung, C.-M. (2011). Program phase-aware dynamic voltage scaling under variable computational workload and memory stall environment. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 30(1), 110–123.CrossRefGoogle Scholar
  56. 56.
    Devadas, V., & Aydin, H. (2010). DFR-EDF: A unified energy management framework for real-time systems. In Real-Time and Embedded Technology and Applications Symposium (RTAS).Google Scholar
  57. 57.
    Ma, K., Li, X., Chen, M., & Wang, X. (2011). Scalable power control for many-core architectures running multi-threaded applications. In Internation Symposium on Computer Architecture.Google Scholar
  58. 58.
    Dehyadegari, M., Marongiu, A., Kakoee, M., Mohammadi, S., Yazdani, N., & Benini, L. (2015). Architecture support for tightly-coupled multi-core clusters with shared-memory HW accelerators. IEEE Transactions on Computer, 64(8), 2132–2144.MathSciNetCrossRefzbMATHGoogle Scholar
  59. 59.
    Sarma, S., Muck, T., Bathen, L., Dutt, N., & Nicolau, A. (2015). SmartBalance: A sensing-driven linux load balancer for energy efficiency of heterogeneous MPSoCs. In Design Automation Conference (DAC).Google Scholar
  60. 60.
    ARM big.LITTLE Architecture. ARM, [Online]. Available: Accessed 07 Aug 2015.
  61. 61.
    Momcilovic, S., Ilic, A., Roma, N., & Sousa, L. (2014). Dynamic load balancing for real-time video encoding on heterogeneous CPU+GPU systems. IEEE Transactions on Multimedia, 16(1), 108–121.CrossRefGoogle Scholar
  62. 62.
    Xiao, W., Li, B., Xu, J., Shi, G., & Wu, F. (2015). HEVC encoding optimization using multi-core CPUs and GPUs. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 9(99), 1–14.Google Scholar
  63. 63.
    Momcilovic, S., Roma, N., & Sousa, L. (2013). Exploiting task and data parallelism for advanced video coding on hybrid CPU + GPU platforms. Journal of Real-Time Image Processing, pp. 1–17.Google Scholar
  64. 64.
    Jian, C., & John, L. (2009). Efficient program scheduling for heterogeneous multi-core processors. In Design Automation Conference (DAC).Google Scholar
  65. 65.
    Mühlbauer, T., Rödiger, W., Seilbeck, R., Kemper, A., & Neumann, T. (2014). Heterogeneity-conscious parallel query execution: Getting a better mileage while driving faster!. In International Workshop on Data Management on New Hardware.Google Scholar
  66. 66.
    Shafique, M., Molkenthin, B., & Henkel, J. (2010). An HVS-based adaptive computational complexity reduction scheme for H.264/AVC video encoder using prognostic early mode exclusion. In Design, Automation and Test in Europe Conference (DATE).Google Scholar
  67. 67.
    Bariani, M., Lambruschini, P., & Raggio, M. (2012). An efficient multi-core SIMD implementation for H.264/AVC encoder. In VLSI Design.Google Scholar
  68. 68.
    Rodríguez, A., González, A., & Malumbres, M. P. (2006). Hierarchical parallelization of an H.264/AVC video encoder. In International Symposium on Parallel Computing in Electrical Engineering.Google Scholar
  69. 69.
    Gong, P., Basciftci, Y., & Ozguner, F. (2012). A parallel resampling algorithm for particle filtering on shared-memory architectures. In Parallel and Distributed Processing Symposium Workshops.Google Scholar
  70. 70.
    Henkel, J., & Yanbing, L. (1998). Energy-conscious HW/SW-partitioning of embedded systems: a case study on an MPEG-2 encoder. In International Workshop on Hardware/Software Codesign.Google Scholar
  71. 71.
    Cuomo, S., Michele, P. D., & Piccialli, F. (2014). 3D data denoising via nonlocal means filter by using parallel GPU strategies. In Computational and Mathematical Methods in Medicine.Google Scholar
  72. 72.
    Moustafa, M., Ebied, H. M., Helmy, A., Nazamy, T. M., & Tolba, M. F. (2014). Satellite super resolution image reconstruction based on parallel support vector regression. In Advanced Machine Learning Technologies and Applications, Springer, pp. 223–235.Google Scholar
  73. 73.
    Garcia Freitas, P., Farias, M. and De Araujo, A. (2014). A parallel framework for video super-resolution. In Graphics, Patterns and Images (SIBGRAPI).Google Scholar
  74. 74.
    Jung, B., & Jeon, B. (2008). Adaptive slice-level parallelism for H.264/AVC encoding using pre macroblock mode selection. Journal of Visual Communication Image Representation, 19(8), 558–572.CrossRefGoogle Scholar
  75. 75.
    Jiang, W., Mal, H., & Chen, Y. (2012). Gradient based fast mode decision algorithm for intra prediction in HEVC. In International Conference on Consumer Electronics, Communications and Networks.Google Scholar
  76. 76.
    Cassa, M. B., Naccari, M., & Pereira, F. (2012). Fast rate distortion optimization for the emerging HEVC standard. In Picture Coding Symposium.Google Scholar
  77. 77.
    Zhang, H., & Ma, Z. (2012). Fast intra prediction for high efficiency video coding. In Advances in Multimedia Information Processing.Google Scholar
  78. 78.
    Sun, H., Zhou, D., & Goto, S. (2012). A low-complexity HEVC Intra prediction algorithm based on level and mode filtering,. In International Conference on Multimedia and Expo (ICME).Google Scholar
  79. 79.
    Pan, F., Lin, X., Rahardja, S., Lim, K. P., Li, Z. G., Wu, D., & Wu, S. (2005). Fast mode decision algorithm for intraprediction in H.264/AVC video coding. IEEE Transactions on Circuits and Systems for Video Technology, 15(7), 813–822.CrossRefGoogle Scholar
  80. 80.
    Tsai, A. C., Paul, A., Wang, J. C., & Wang, J. F. (2008). Intensity gradient technique for efficient intra-prediction in H.264/AVC. IEEE Transactions on Circuits and Systems on Video Technology, 18(5), 694–698.CrossRefGoogle Scholar
  81. 81.
    Fonseca, T. A., Liu, Y., & Queiroz, R. L. D. (2007). Open-loop prediction in H.264 / AVC for high definition sequences. In SBrT.Google Scholar
  82. 82.
    Tian, G., & Goto, S. (2012). Content adaptive prediction unit size decision algorithm for HEVC intra coding. In Picture Coding Symposium.Google Scholar
  83. 83.
    Zhao, L., Zhang, L., Ma, S., & Zhao, D. (2011). Fast mode decision algorithm for Intra prediction in HEVC. In Visual Communications and Image Processing (VCIP).Google Scholar
  84. 84.
    Silva, T. D., Agostini, L. V., & Cruz, L. A. D. S. C. (2012). Fast HEVC intra prediction mode decision based on EDGE direction information. In European Signal Processing Conference (Eusipco).Google Scholar
  85. 85.
    Haan, G. D., & Biezen, P. (1998). An efficient true-motion estimator using candidate vectors from a parametric motion model. IEEE Transactions on Circuits and Systems for Video Technology, 8(9), 86–91.Google Scholar
  86. 86.
    Shim, H., & Kyung, C.-M. (2009). Selective search area reuse algorithm for low external memory access motion estimation. IEEE Transactions on Circuits and Systems for Video Technology, 19(7), 1044–1050.CrossRefGoogle Scholar
  87. 87.
    Kun, Z., Chun, Y., Qiang, L., & Yuzhuo, Z. (2007). A fast block type decision method for H.264/AVC intra prediction. In International Conference on Advanced Communication Technology.Google Scholar
  88. 88.
    Lin, Y.-K., & Chang, T. (2005). Fast block type decision algorithm for intra prediction in H.264/FRex. In Internatianal conference on Image Processing (ICIP).Google Scholar
  89. 89.
    Kivanc Mihcak, M., Kozintsev, I., Ramchandran, K., & Moulin, P. (1999). Low-complexity image denoising based on statistical modeling of wavelet coefficients. IEEE Signal Processing Letters, 6(12), 300–303.CrossRefGoogle Scholar
  90. 90.
    Khan, M. U. K., Bais, A., Khawaja, M., Hassan, G. M., & Arshad, R. (2009). A swift and memory efficient hough transform for systems with limited fast memory. In International Conference on Image Analysis and Recognition (ICIAR).Google Scholar
  91. 91.
    OpenCV. [Online]. Available: Accessed 08 Aug 2015.
  92. 92.
    OpenVX. [Online]. Available: Accessed 08 Aug 2015.
  93. 93.
    Muthukaruppan, T. S., Pricopi, M., Venkataramani, V., Mitra, T., & Vishin, S. (2013). Hierarchical power management for asymmetric multi-core in dark silicon era. In Design Automation Conference (DAC).Google Scholar
  94. 94.
    Khdr, H., Ebi, T., Shafique, M., Amrouch, H., & Henkel, J. (2014). mDTM: Multi-objective dynamic thermal management for on-chip systems. In Design, Automation and Test in Europe Conference and Exhibition (DATE).Google Scholar
  95. 95.
    Devadas, V., & Aydin, H. (2012). On the interplay of voltage/frequency scaling and device power management for frame-based real-time embedded applications. IEEE Transactions on Computers, 61(1), 31–44.MathSciNetCrossRefzbMATHGoogle Scholar
  96. 96.
    Isci, C., Buyuktosunoglu, A., Cher, C., Bose, P., & Martonosi, M. (2006). An analysis of efficient multi-core global power management policies: Maximizing performance for a given power budget. In Microarchitecture.Google Scholar
  97. 97.
    Pagani, S., Khdr, H., Munawar, W., Chen, J.-J., Shafique, M., Li, M., & Henkel, J. (2014). TSP: Thermal safe power: Efficient power budgeting for many-core systems in dark silicon. In International Conference on Hardware/Software Codesign and System Synthesis.Google Scholar
  98. 98.
    Mishra, A., Srikantaiah, S., Kandemir, M., & Das, C. R. (2010). CPM in CMPs: Coordinated power management in chip-multiprocessors. In International Conference on High Performance Computing, Networking, Storage and Analysis.Google Scholar
  99. 99.
    Winter, J. A., Albonesi, D. H., & Shoemaker, C. A. (2010). Scalable thread scheduling and global power management for heterogeneous many-core architectures. In Parallel Architectures and Compilation.Google Scholar
  100. 100.
    Sharifi, A., Mishra, A., Srikantaiah, S., Kandemir, M., & Das, C. R. (2012). PEPON: Performance-aware hierarchical power budgeting for NoC based multicores. In Parallel Architectures and Compilation Techniques.Google Scholar
  101. 101.
    Shafique, M., Garg, S., Henkel, J., & Marculescu, D. (2014). The EDA challenges in the dark silicon era. In Design Automation Conference.Google Scholar
  102. 102.
    Huang, Y.-W., Hsieh, B.-Y., Chen, T.-C., & Chen, L.-G. (2005). Analysis, fast algorithm, and VLSI architec-ture design for H.264/AVC intra frame coder. IEEE Transactions on Circuits and Systems for Video Technology, 15(3), 378–401.CrossRefGoogle Scholar
  103. 103.
    Wang, J.-C., Wang, J.-F., Yang, J.-F., & Chen, J.-T. (2007). A fast mode decision algorithm and its VLSI design for H.264/AVC intra-prediction. IEEE Transactions on Circuits and Systems for Video Technology, 17(10), 1414–1422.CrossRefGoogle Scholar
  104. 104.
    Roszkowski, M., & Pastuszak, G. (2010). Intra prediction hardware module for high-profile H.264/AVC encoder. In Signal Processing Algorithms, Architectures, Arrangements and Applications Conference.Google Scholar
  105. 105.
    He, G., Zhou, D., Zhou, J., & Goto, S. (2010). Intra prediction architecture for H.264/AVC QFHD encoder. In Picture Coding Symposium.Google Scholar
  106. 106.
    Diniz, C., Zatt, B., Thiele, C., Susin, A., Bampi, S., Sampaio, F., Palomino, D., & Agostini, L. (2011). A high throughput H.264/AVC intra-frame encod-ing loop architecture for HD1080p. In International Symposium on Circuits and Systems.Google Scholar
  107. 107.
    Li, F., Shi, G., & Wu, F. (2011). An efficient VLSI architecture for 4×4 intra prediction in the High Efficiency Video Coding (HEVC) standard. In International Conference on Image Processing.Google Scholar
  108. 108.
    Cervero, T., Otero, A., López, S., de la Torre, E., Callicó, G., Riesgo, T., & Sarmiento, R. (2013). A scalable H.264/AVC deblocking filter architecture. Journal of Real-Time Image Processing, pp. 1–25.Google Scholar
  109. 109.
    Mangard, S., Aigner, M., & Dominikus, S. (2003). A highly regular and scalable AES hardware architecture. IEEE Transactions on Computers, 52(4), 483–491.CrossRefGoogle Scholar
  110. 110.
    Varatkar, G. V., & Shanbhag, N. R. (2008). Error-resilient motion estimation architecture. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 16(10), 1399–1412.CrossRefGoogle Scholar
  111. 111.
    Saponara, S., & Fanucci, L. (2004). Data-adaptive motion estimation algorithm and VLSI architecture design for low-power video systems. IEE Proceedings-Computers and Digital Techniques, 151(1), 51–59.CrossRefGoogle Scholar
  112. 112.
    Tsai, C.-Y., Chung, C., Chen, Y.-H., Chen, T.-C., & Chen, L.-G. (2007). Low power cache algorithm and architecture design for fast motion estimation in H. 264/AVC encoder system. In In-ternational Conference on Acoustics, Speech and Signal Processing.Google Scholar
  113. 113.
    Kim, N. S., Austin, T., Blaauw, D., Mudge, T., Flautner, K., Hu, J. S., Irwin, M. J., Kandemir, M., & Narayanan, V. (2003). Leakage current: Moore’s law meets static pow-e. Computers, 36(12), 68–75.CrossRefGoogle Scholar
  114. 114.
    Ma, Z., & Segall, A. (2011). Frame buffer compression for low-power video coding. In International Conference on Image Processing.Google Scholar
  115. 115.
    Hsu, M.-Y. (2000). Scalable module-based architecture for MPEG-4 BMA motion estimation.Google Scholar
  116. 116.
    Tuan, J.-C., Chang, T.-S., & Jen, C.-W. (2002). On the data reuse and memory bandwidth analysis for full-search block-matching VLSI architecture. IEEE Transactions on Circuits and Systems for Video Technology, 12(1), 61–72.CrossRefGoogle Scholar
  117. 117.
    Wu, X., Li, J., Zhang, L., Speight, E., Rajamony, R., & Xie, Y. (2009). Hybrid cache architecture with disparate memory technologies. In International Symposium on Computer Architecture (ISCA).Google Scholar
  118. 118.
    Diao, Z., Li, Z., Wang, S., Ding, Y., Panchula, A., Chen, E., Wang, L.-C., & Huai, Y. (2007). Spin-transfer torque switching in magnetic tunnel junctions and spin-transfer torque random access memory. Journal of Physics: Condensed Matter, 19(16), 1–13.Google Scholar
  119. 119.
    Dong, X., Wu, X., Sun, G., Xie, Y., Li, H., & Chen, Y. (2008). Circuit and microarchitecture evaluation of 3D stacking magnetic RAM (MRAM) as a universal memory replacement. In Design Automation Conference (DAC).Google Scholar
  120. 120.
    Qureshi, M. K., Srinivasan, V., & Rivers, J. A. (2009). Scalable high perfor-mance main memory system using phase-change memory technology. In International Symposium on Computer Architec-ture (ISCA).Google Scholar
  121. 121.
    Hanzawa, S., Kitai, N., Osada, K., Kotabe, A., Matsui, Y., Matsuzaki, N., Takaura, N., Moniwa, M., & Kawahara, T. (2007). A 512KB Embed-ded phase change memory with 416kB/s write throughput at 100uA cell write current. In International Solid-State Circuits Conference (ISSCC).Google Scholar
  122. 122.
    Yang, S., & Ryu, Y. (2012). A memory management scheme for hybrid memory architecture in mission critical computers. In International Conference on Software Technology.Google Scholar
  123. 123.
    Dhiman, G., Ayoub, R., & Rosing, T. (2009). PDRAM: A hybrid PRAM and DRAM main memory system. In Design Automation Conference.Google Scholar
  124. 124.
    Bathen, L., & Dutt, N. (2012). HaVOC: A hybrid memory-aware virtualization layer for on-chip distributed scratchpad and non-volatile memories. In Design Automation Conference.Google Scholar
  125. 125.
    Stancu, L. C., Bathen, L. A. D., Dutt, N., & Nicolau, A. (2012). AVid : Annotation driven video decoding for hybrid memories. In Embedded Systems for Real-Time Multimedia.Google Scholar
  126. 126.
    Desikan, R., Lefurgy, C., Keckler, S., & Burger, D. (2002). On-chip MRAM as a high-bandwidth, low-latency replacement for DRAM physical memories. University of Texas at Austin.Google Scholar
  127. 127.
    Nomura, K., Abe, K., Yoda, H., & Fujita, S. (2012). Ultra low power processor using perpendicular-STT-MRAM/SRAM based hy-brid cache toward next generation normally-off computers. Journal of Applied Physics, 111(7).Google Scholar
  128. 128.
    Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., & Hanrahan, P. (2008). Larrabee: A Many-Core x86 Architecture for Visual Computing. ACM Transactions on Graphics, 27(3).Google Scholar
  129. 129.
    Mattina, M. (2014). Architecture and Performance of the Tile-GX Processor Family, White Paper.Google Scholar
  130. 130.
    Shafique, M., Bauer, L., & Henkel, J. (2010). Optimizing the H.264/AVC video encoder application structure for reconfigurable and application-specific platforms. Journal of Signal Processing Systems (JSPS), 60(2), 183–210.CrossRefGoogle Scholar
  131. 131.
    Liu, C., Granados, O., Duarte, R., & Andrian, J. (2012). Energy efficient architecture using hardware acceleration for software defined radio components. Journal of Information Processing Systems, 8(1), 133–144.CrossRefGoogle Scholar
  132. 132.
    Nios II Custom Instruction User Guide. Altera, (2011).Google Scholar
  133. 133.
    Khan, M. U. K., Shafique, M., & Henkel, J. (2013). Hardware-software collaborative complexity reduction scheme for the emerging HEVC intra encoder. In Design, Automation and Test in Europe (DATE).Google Scholar
  134. 134.
    Shojania, H., & Baochun, L. (2007). Parallelized progressive network coding with hardware acceleration. In International Workshop on Quality of Service.Google Scholar
  135. 135.
    Doan, H. C., Javaid, H., & Parameswaran, S. (2014). Flexible and scalable implementation of H.264/AVC encoder for multiple resolutions using ASIPs. In Design, Automation and Test in Europe Conference and Exhibition (DATE).Google Scholar
  136. 136.
    Kim, S. D., Lee, J. H., Hyun, C. J., & Sunwoo, M. H. (2006). ASIP approach for implementation of H.264/AVC. In Asia and South Pacific Conference on Design Automation (ASP-DAC).Google Scholar
  137. 137.
    Swanson, S., & Taylor, M. B. (2011). GreenDroid: Exploring the next evolution in smartphone application processors. IEEE Communications Magazine, 49(4), 112–119.CrossRefGoogle Scholar
  138. 138.
    Teich, J., Henkel, J., Herkersdorf, A., Schmitt-Landsiedel, D., Schröder-Preikschat, W., & Snelting, G. (2010). Invasive computing: An overview. In Multiprocessor System-on-Chip, Springer, pp. 241–268.Google Scholar
  139. 139.
    Sheldon, D., & Forin, A. (2010). An online scheduler for hardware accelerators on general purpose operating systems. Microsoft Research.Google Scholar
  140. 140.
    Huang, C., Sheldon, D., & Vahid, F. (2008). Dynamic tuning of configurable architectures: The AWW online algorithm. In International Conference on Hardware/Software Codesign and System Synthesis.Google Scholar
  141. 141.
    Majumder, T., Pande, P. P., & Kalyanaraman, A. (2013). High-throughput, energy-efficient network-on-chip-based hardware accelerators. Journal of Sustainable Computing: Informatics and Systems, 3(1), 36–46.Google Scholar
  142. 142.
    Cong, J., Ghodrat, M. A., Gill, M., Grigorian, B., & Reinman, G. (2012). Architecture support for accelerator-rich CMPs. In Design Automation Conference.Google Scholar
  143. 143.
    Cota, E., Mantovani, P., Petracca, M., Casu, M., & Carloni, L. (2012). Accelerator memory reuse in the dark silicon era. Computer Architecture Letters, pp. 1–4.Google Scholar
  144. 144.
    Clemente, J. A., Beretta, I. V., Rana, V., Atienza, D., & Sciuto, D. (2014). A mapping-scheduling algorithm for hardware acceleration on reconfigurable platform. Transactions on Reconfigurable Technology and Systems, 7(2).Google Scholar
  145. 145.
    Paul, S., Karam, R., Bhunia, S., & Puri, R. (2014). Energy-efficient hardware acceleration through computing in the memory. In Design, Automation and Test in Europe (DATE).Google Scholar
  146. 146.
    Kothawade, S., Chakraborty, K., & Roy, S. (2011). Analysis and mitigation of NBTI aging in register file: An end-to-end approach. In International Symposium on Quality Electronic Design (ISQED).Google Scholar
  147. 147.
    Amrouch, H., Ebi, T., & Henkel, J. (2013). Stress balancing to mitigate NBTI Effects in register files. In Dependable Systems and Networks (DSN).Google Scholar
  148. 148.
    Siddiqua, T., & Gurumurthi, S. (2010). Recovery boosting: A technique to enhance NBTI recovery in SRAM arrays. In Annual Symposium on VLSI.Google Scholar
  149. 149.
    Sil, A., Ghosh, S., Gogineni, N., & Bayoumi, M. (2008). A novel high write speed, low power, read-SNM-Free 6T SRAM cell. In Midwest Symposium on Circuits and Systems.Google Scholar
  150. 150.
    Abella, J., Vera, X., Unsal, O., & Gonzalez, A. (2008). NBTI-resilient memory cells with NAND gates. US Patent US20080084732 A1.Google Scholar
  151. 151.
    Wang, S., Jin, T., Zheng, C., & Duan, G. (2012). Low power aging-aware register file design by duty cycle balancing. In Design, Automation and Test in Europe (DATE).Google Scholar
  152. 152.
    Wang, S., Duan, G., Zheng, C., & Jin, T. (2013). Combating NBTI-induced aging in data caches. In Great lakes symposium on VLSI.Google Scholar
  153. 153.
    Gunadi, E., Sinkar, A. A., Kim, N. S., & Lipasti, M. H. (2010). Combating aging with the colt duty cycle equalizer. In International Symposium on Microarchitecture.Google Scholar
  154. 154.
    Calimera, A., Loghi, M., Macii, E., & Poncino, M. (2011). Partitioned cache architectures for reduced NBTI-induced aging. In Design, Automation and Test in Europe (DATE).Google Scholar
  155. 155.
    Henkel, J., Bukhari, H., Garg, S., Khan, M. U. K., Khdr, H., Kriebel, F., Ogras, U., Parameswaran, S., & Shafique, M. (2015). Dark silicon – From computation to communication. In International Symposium on Networks-on-Chip (NOCs).Google Scholar
  156. 156.
    Esmaeilzadeh, H., Sampson, A., Ceze, L., & Burger, D. (2012). Neural acceleration for general-purpose approximate programs. In International Symposium on Microarchitecture.Google Scholar
  157. 157.
    Mahajan, D., Yazdanbakhsh, A., Park, J., Thwaites, B., & Esmaeilzadeh, H. (2015). Prediction-based quality control for approximate accelerators. In Workshop on Approximate Computing Across the System Stack.Google Scholar
  158. 158.
    Allred, J., Roy, S., & Chakraborty, K. (2012). Designing for dark silicon: A methodological perspective on energy efficient systems. In International Symposium on Low Power Electronics and Design (ISLPED).Google Scholar
  159. 159.
    Venkatesh, G., Sampson, J., Goulding, N., Garcia, S., Bryksin, V., Lugo-Martinez, J., Swanson, S., & Taylor, M. B. (2010). Conservation cores: Reducing the energy of mature computations. In Architectural Support for Programming Languages and Operating Systems.Google Scholar
  160. 160.
    Swaminathan, K., Kultursay, E., Saripalli, V., Narayanan, V., Kandemir, M., & Datta, S. (2013). Steep-slope devices: From dark to dim silicon. IEEE Micro, 33(5), 50–59.CrossRefGoogle Scholar
  161. 161.
    Bokhari, H., Javaid, H., Shafique, M., Henkel, J., & Parameswaran, S. (2014). darkNoC: Designing energy-efficient network-on-chip with multi-Vt cells for dark silicon. In Design Automation Conference (DAC).Google Scholar
  162. 162.
    Raghunathan, B., Turakhia, Y., Garg, S., & Marculescu, D. (2013). Cherry-picking: Exploiting process variations in dark-silicon homogeneous chip multi-processors. In Design, Automation & Test in Europe Conference & Exhibition (DATE).Google Scholar
  163. 163.
    Shafique, M., Gnad, D., Garg, S., & Henkel, J. (2015). Variability-aware dark silicon management in on-chip many-core systems. In Design, Automation and Test in Europe Conference and Exhibition.Google Scholar
  164. 164.
    Huang, W., Rajamani, K., Stan, M., & Skadron, K. (2011). Scaling with design constraints: Predicting the future of big chips. IEEE Micro, 31(4), 16–29.CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  • Muhammad Usman Karim Khan
    • 1
  • Muhammad Shafique
    • 2
  • Jörg Henkel
    • 3
  1. 1.IBM Deutschland Research & Development GmbHBöblingenGermany
  2. 2.Institute of Computer EngineeringVienna University of TechnologyViennaAustria
  3. 3.Department of Computer ScienceKarlsruhe Institute of TechnologyKarlsruheGermany

Personalised recommendations