High-performance code optimizations for mobile devices

  • Sergio AfonsoEmail author
  • Alejandro Acosta
  • Francisco Almeida


Mobile devices have seen their performance increased in latest years due to improvements on System on Chip technologies. These shared memory systems now integrate multicore CPUs and accelerators, and obtaining the optimal performance from such heterogeneous architectures requires making use of accelerators in an efficient way. Graphics Processing Units (GPUs) are accelerators that often outperform multicore CPUs in data-parallel workloads by orders of magnitude, so their use for image processing applications on mobile devices is very important. In this work we explore tiling code optimizations for GPU applications running on mobile devices. A dynamic adaptive tile size selection methodology is created, which allows finding at runtime close-to-optimal parameterizations independently of the underlying architecture. Results demonstrate the performance benefits of these optimizations over a set of stencil-based image processing benchmarks.


Auto-tuning GPGPU OpenCL Android Heterogeneous architecture 


  1. 1.
    Acosta A, Almeida F (2015) Towards the optimal execution of renderscript applications in android devices. Simul Model Pract Theory 58:55–64. CrossRefGoogle Scholar
  2. 2.
    Afonso S, Acosta A, Almeida F (2017) Automatic acceleration of stencil codes in android devices, pp. 81–95. Springer International Publishing, Cham. CrossRefGoogle Scholar
  3. 3.
    Almeida F, Andonov R, González D, Moreno LM, Poirriez V, Rodríguez C (2002) Optimal tiling for the RNA base pairing problem. In: SPAA, pp. 173–182.
  4. 4.
    Andonov R, Rajopadhye S (1997) Optimal orthogonal tiling of 2-d iterations. J Parallel Distrib Comput 45(2):159–165. CrossRefzbMATHGoogle Scholar
  5. 5.
    ARM: Mali graphics and multimedia processors.
  6. 6.
    Boratto M, Alonso P, Giménez D, Barreto M (2013) Oliveira K Auto-tuning methodology to represent landform attributes on multicore and multi-gpu systems. In: Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM ’13, pp. 125–132. ACM, New York, NY, USA.
  7. 7.
    Boratto M, Alonso P, Giménez D, Lastovetsky A (2017) Automatic tuning to performance modelling of matrix polynomials on multicore and multi-gpu systems. J Supercomput 73(1):227–239. CrossRefGoogle Scholar
  8. 8.
    Chu SL, Hsiao CC (2013) Methods for optimizing opencl applications on heterogeneous multicore architectures. Appl Math Inf Sci 7(6):2549CrossRefGoogle Scholar
  9. 9.
    García LP, Cuenca J, Giménez D (2007) Including improvement of the execution time in a software architecture of libraries with self-optimisation. In: ICSOFT (SE), pp. 156–161. CiteseerGoogle Scholar
  10. 10.
    Holewinski J, Pouchet LN, Sadayappan P (2012) High-performance code generation for stencil computations on gpu architectures. In: Proceedings of the 26th ACM International Conference on Supercomputing, pp. 311–320. ACMGoogle Scholar
  11. 11.
    Imagination: A quick guide to writing OpenCL kernels for PowerVR Rogue GPUs. Accessed 9 Oct 2018
  12. 12.
    Magni A, Dubach C, O’Boyle MFP (2013) A large-scale cross-architecture evaluation of thread-coarsening. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’13, pp. 11:1–11:11. ACM, New York, NY, USA.
  13. 13.
    Qualcomm: Adreno GPU SDK. Accessed 9 Oct 2018
  14. 14.
    Ragan-Kelley J, Barnes C, Adams A, Paris S, Durand F, Amarasinghe S (2013) Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. SIGPLAN Not. 48(6):519–530. CrossRefGoogle Scholar
  15. 15.
    Rocha RCO, Pereira AD, Ramos L, Góes LFW (2017) Toast: automatic tiling for iterative stencil computations on gpus. Concurr Comput Pract Exp 29(8):4053. CrossRefGoogle Scholar
  16. 16.
    Shen J, Fang J, Sips H, Varbanescu AL (2013) Performance traps in opencl for cpus. In: 2013 21st Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), pp. 38–45. IEEEGoogle Scholar
  17. 17.
    StatCounter: Mobile operating system market share worldwide. Accessed 9 Oct 2018
  18. 18.
    Vivante: Vivante Vega GPGPU technology. Accessed 9 Oct 2018
  19. 19.
    Whaley RC, Petitet A, Dongarra JJ (2001) Automated empirical optimizations of software and the atlas project. Parallel Comput 27(1):3–35. CrossRefGoogle Scholar
  20. 20.
    Wolfe M (1989) More iteration space tiling. In: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, Supercomputing ’89, pp. 655–664. ACM, New York, NY, USA.
  21. 21.
    Zhang Y, Sinclair M, Chien AA (2013) Improving performance portability in opencl programs. In: ISC, pp. 136–150. SpringerGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer Engineering and Systems, Escuela Superior de Ingeniería y TecnologíaUniversidad de La LagunaSanta Cruz de TenerifeSpain

Personalised recommendations