Study of parallel processing area extraction and data transfer number reduction for automatic GPU offloading of IoT applications


To overcome of the high cost of developing IoT (Internet of Things) services by vertically integrating devices and services, Open IoT has been developed to enable various IoT services to be developed by integrating horizontally separated devices and services. For Open IoT, we have proposed Tacit Computing technology to discover the devices that can provide the data users need on demand and use them dynamically. We have also proposed an automatic GPU (graphics processing unit) offloading method as an elementary technology of Tacit Computing. However, our GPU offloading method can improve only a limited number of applications because it only optimizes the extraction of parallelizable loop statements. Therefore, in this paper, to improve performances of more applications automatically, we propose an improved GPU offloading method with fewer data transfers between the CPU and GPU that can improve performance of many IoT applications. We evaluate our proposed GPU offloading method by applying it to Darknet and Fourier Transform, which are general large applications for CPU, and find that it can process them 3 times and 5 times as quickly as only using CPUs within 10-hour tuning time.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7


  1. Beylkin, G., Fann, G., Harrison, R.J., Kurcz, C., Monzon, L. (2012). Multiresolution representation of operators with boundary conditions on simple domains. Elsevier Applied and Computational Harmonic Analysis, 33(1), 109–139.

    MathSciNet  Article  Google Scholar 

  2. Clang Website. (2018). Accessed 20 May 2019.

  3. Hermann, M., Pentek, T., Otto, B. (2015). Design principles for Industrie 4.0 scenarios, Working Draft, Rechnische Universitat Dortmund.

  4. Holland, J.H. (1992). Genetic algorithms. Scientific american, 267(1), 66–73.

    Article  Google Scholar 

  5. Ishizaki, K. (2016). Transparent GPU exploitation for Java. In The fourth international symposium on computing and networking (CANDAR 2016).

  6. Laplace Equation Source Website. (2018). Accessed 20 May 2019.

  7. NAS.FT Website. (2018). Accessed 20 May 2019.

  8. Putnam, A., Caulfield, A.M., Chung, E.S., Chiou, D., Constantinides, K., Demme, J., Esmaeilzadeh, H., Fowers, J., Gopal, G.P., Gray, J., Haselman, M., Hauck, S., Heil, S., Hormati, A., Kim, J.-Y., Lanka, S., Larus, J., Peterson, E., Pope, S., Smith, A., Thong, J., Xiao, P.Y., Burger, D. (2014). A reconfigurable fabric for accelerating large-scale datacenter services. In Proceedings of the 41th annual international symposium on computer architecture (ISCA’14) (pp. 13–24).

  9. Redmon, J., & Angelova, A. (2015). Real-time grasp detection using convolutional neural networks. In IEEE international conference on robotics and automation (ICRA) (p. 2015).

  10. Sanders, J., & Kandrot, E. (2011). CUDA by example: an introduction to general-purpose GPU programming, Addison-Wesley ISBN-0131387685.

  11. Shirahata, K., Sato, H., Matsuoka, S. (2010). Hybrid map task scheduling for GPU-based heterogeneous clusters. In IEEE second international conference on cloud computing technology and science (CloudCom) (pp. 733–740).

  12. Shitara, A., Nakahama, T., Yamada, M., Kamata, T., Nishikawa, Y., Yoshimi, M., Amano, H. (2011). Vegeta: an implementation and evaluation of development-support middleware on multiple opencl platform. In IEEE second international conference on networking and computing (ICNC 2011) (pp. 141–147).

  13. Stone, J.E., Gohara, D., Shi, G. (2010). OpenCL: a parallel programming standard for heterogeneous computing systems. Computing in Science & Engineering, 12 (3), 66–73.

    Article  Google Scholar 

  14. Su, E., Tian, X., Girkar, M., Haab, G., Shah, S., Petersen, P. (2002). Compiler support of the workqueuing execution model for Intel SMP architectures. In Fourth European workshop on OpenMP.

  15. Sunaga, H., Yamato, Y., Ohnishi, H., Kaneko, M., Iio, M., Hirano, M. (2008). Service delivery platform architecture for the next-generation network, ICIN 2008, Session 9-A.

  16. Tanaka, Y., Miki, M., Yoshimi, M., Hiroyasu, T. (2011). Evaluation of optimization method for fortran codes with GPU automatic parallelization compiler. IPSJ SIG Technical Report, 2011(9), 1–6.

    Google Scholar 

  17. Tomatsu, Y., Hiroyasu, T., Yoshimi, M., Miki, M. (2010). Gpot: intelligent compiler for GPGPU using combinatorial optimization techniques. In The 7th joint symposium between Doshisha University and Chonnam National University.

  18. Tron Project Web Site. (2018). Accessed 20 May 2019.

  19. Wienke, S., Springer, P., Terboven, C., an Mey, D. (2012). Open ACC-first experiences with real-world applications. Euro-Par 2012 Parallel Processing, pp. 859–870.

  20. Wolfe, M. (2010). Implementing the PGI accelerator model. In ACM the 3rd workshop on general-purpose computation on graphics processing units (pp. 43–50).

  21. Wuhib, F., Stadler, R., Lindgren, H. (2012). Dynamic resource allocation with management objectives - implementation for an OpenStack cloud. In 2012 8th international conference and 2012 workshop on systems virtualiztion management, Proceedings of Network and service management (pp. 309–315).

  22. Yamato, Y. (2007). Ubiquitous service composition technology for ubiquitous network environments. IPSJ Journal, 48(2), 562–577.

    Google Scholar 

  23. Yamato, Y. (2015a). Use case study of HDD-SSD hybrid storage, distributed storage and HDD storage on OpenStack. In 19th international database engineering & applications symposium (IDEAS15) (pp. 228–229).

  24. Yamato, Y. (2015b). OpenStack Hypervisor, container and baremetal servers performance comparison. IEICE Communication Express, 4(7), 228–232.

    Article  Google Scholar 

  25. Yamato, Y. (2015c). Automatic verification technology of software patches for user virtual environments on IaaS cloud, Journal of Cloud Computing, Springer, 2015, 4:4,

  26. Yamato, Y. (2016a). Cloud storage application area of HDD-SSD hybrid storage, distributed storage and HDD storage. IEEJ Transactions on Electrical and Electronic Engineering, 11(5), 674–675.

    Article  Google Scholar 

  27. Yamato, Y. (2016b). Performance-aware server architecture recommendation and automatic performance verification technology on IaaS cloud, Service oriented computing and applications, Springer.

  28. Yamato, Y. (2017a). Server selection, configuration and reconfiguration technology for IaaS cloud with multiple server types, Journal of Network and Systems Management, Springer,

  29. Yamato, Y. (2017b). Optimum application deployment technology for heterogeneous IaaS cloud. Journal of Information Processing, 25(1), 56–58.

    Article  Google Scholar 

  30. Yamato, Y., & Sunaga, H. (2007). Context-aware service composition and component change-over using semantic web techniques. In IEEE international conference on web services (ICWS 2007) (pp. 687–694).

  31. Yamato, Y., Tanaka, Y., Sunaga, H. (2006). Context-aware ubiquitous service composition technology. In The IFIP international conference on research and practical issues of enterprise information systems (CONFENIS 2006) (pp. 51–61).

  32. Yamato, Y., Ohnishi, H., Sunaga, H. (2008). Development of service control server for web-telecom coordination service. In IEEE international conference on web services (ICWS 2008) (pp. 600–607).

  33. Yamato, Y., Nishizawa, Y., Nagao, S., Sato, K. (2015a). Fast and reliable restoration method of virtual resources on OpenStack, IEEE Transactions on Cloud Computing,

  34. Yamato, Y., Katsuragi, S., Nagao, S., Miura, N. (2015b). Software maintenance evaluation of agile software development method based on OpenStack. IEICE Transactions on Information & Systems, E98-D(7), 1377–1380.

    Article  Google Scholar 

  35. Yamato, Y., Fukumoto, Y., Kumazaki, H. (2017). Predictive maintenance platform with sound stream analysis in edges. Journal of Information Processing, 25, 317–320.

    Article  Google Scholar 

  36. Yamato, Y., Demizu, T., Noguchi, H., Kataoka, M. (2018a). Automatic GPU offloading technology for open IoT environment. IEEE Internet of Things Journal.

  37. Yamato, Y., Noguchi, H., Kataoka, M., Isoda, T., Demizu, T. (2018b). Proposal of parallel processing area extraction and data transfer number reduction for automatic GPU offloading of IoT applications. In The 3rd international conference on smart computing and communication (SmartCom 2018) (pp. 39–54).

  38. Yokohata, Y., Yamato, Y., Takemoto, M., Sunaga, H. (2006a). Service composition architecture for programmability and flexibility in ubiquitous communication networks. In IEEE international symposium on applications and the internet workshops (SAINTW’06) (pp. 142–145).

  39. Yokohata, Y., Yamato, Y., Takemoto, M., Tanaka, E., Nishiki, K. (2006b). Context-aware content-provision service for shopping malls based on ubiquitous Service-Oriented network framework and authentication and access control agent framework. In IEEE consumer communications and networking conference (CCNC 2006) (pp. 1330–1331).

Download references

Author information



Corresponding author

Correspondence to Yoji Yamato.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Yamato, Y. Study of parallel processing area extraction and data transfer number reduction for automatic GPU offloading of IoT applications. J Intell Inf Syst 54, 567–584 (2020).

Download citation


  • Open IoT
  • Tacit computing
  • Data transfer optimization
  • Genetic algorithm
  • Automatic offloading