Skip to main content

Accelerating Decoding Step in Image Captioning on Smartphones

  • Conference paper
  • First Online:
Book cover High-Performance Computing and Big Data Analysis (TopHPC 2019)

Abstract

In recent years, many efforts have been conducted to increase the accuracy of neural image captioning as one of the diverse applications of deep neural networks. Text-based image retrieval can be considered as one of the important applications of the image captioning. Moreover, improving the quality of life for visually impaired people is another crucial application of the image captioning. Accordingly, rapid and optimal implementations that can work effectively on mobile processors seems to be necessary. Despite the numerous image captioning approaches presented so far, few solutions are provided that consider the mobile computational capabilities. In this paper, we practically focused on the decoding step for the implementation of image captioning in android applications. Actually, iteration over variable lengths sequences can be performed using dynamic control flow. In other words, implementing such iterative algorithms using dynamic control flow may prevent unrolling the computation to a fixed maximum length. Using this facility will result in increased speed of the decoding routine in image captioning on smartphone devices. Experimental results on execution time validate the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Samsung Galaxy J5.

  2. 2.

    Huawei Honor 6x.

  3. 3.

    Samsung Galaxy J5.

References

  1. Kiros, R., Salakhutdinov, R., Zemel, R.: Multimodal neural language models. In: ICML, pp. 595–603 (2014)

    Google Scholar 

  2. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of IEEE Computer Society Conference on Computer Vision Pattern Recognition, 12–June, vol. 07, pp. 3156–3164 (2015)

    Google Scholar 

  3. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention (2015)

    Google Scholar 

  4. Pedersoli, M., Lucas, T., Schmid, C., Verbeek, J.: Areas of attention for image captioning. arXiv (CVPR sub) (2016)

    Google Scholar 

  5. Mathur, P., Gill, A., Yadav, A., Mishra, A., Bansode, N.K.: Camera2Caption: a real-time image caption generator. ICCIDS 2017 – Proceedings of International Conference on Computational Intelligence Data Science, vol. 2018, no. 2015, pp. 1–6, (2018)

    Google Scholar 

  6. Farhadi, A., et al.: Every Picture Tells a Story: Generating Sentences from Images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_2

    Chapter  Google Scholar 

  7. Kulkarni, G., et al.: Baby talk: understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2891–2903 (2013)

    Article  Google Scholar 

  8. Elliott, D., Keller, F.: Image description using visual dependency representations. In: EMNLP, pp. 1292–1302, October 2013

    Google Scholar 

  9. Ordonez, V., Kulkarni, G., Berg, T.L.: Im2Text: describing images using 1 million captioned photographs. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 24. Curran Associates, Inc., pp. 1143–1151 (2011)

    Google Scholar 

  10. Gupta, A., Verma, Y., Jawahar, C.V., et al.: Choosing linguistics over vision to describe images. In: AAAI, p. 1 (2012)

    Google Scholar 

  11. Mao, J., Xu, W., Yang, Y., Wang, J., Yuille, A.L.: Explain images with multimodal recurrent neural networks, pp. 1–9 (2014)

    Google Scholar 

  12. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI, vol. 4, p. 12 (2017)

    Google Scholar 

  13. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  14. Abadi, M., et al.: TensorFlow : a system for large-scale machine learning (2016)

    Google Scholar 

  15. Yu, Y., et al.: Dynamic control flow in large-scale machine learning (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Behnam Samadi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Samadi, B., Mansouri, A., Mahmoudi-Aznaveh, A. (2019). Accelerating Decoding Step in Image Captioning on Smartphones. In: Grandinetti, L., Mirtaheri, S., Shahbazian, R. (eds) High-Performance Computing and Big Data Analysis. TopHPC 2019. Communications in Computer and Information Science, vol 891. Springer, Cham. https://doi.org/10.1007/978-3-030-33495-6_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-33495-6_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-33494-9

  • Online ISBN: 978-3-030-33495-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics