Delving into Precise Attention in Image Captioning

Hu, Shaohan; Huang, Shenglei; Wang, Guolong; Li, Zhipeng; Qin, Zheng

doi:10.1007/978-3-030-36802-9_9

Shaohan Hu⁹,
Shenglei Huang¹⁰,
Guolong Wang⁹,
Zhipeng Li⁹ &
…
Zheng Qin⁹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1143))

Included in the following conference series:

International Conference on Neural Information Processing

2227 Accesses

Abstract

Recent image captioning models usually directly use the output of the last convolutional layer from a pretrained CNN encoder. This intuitive design remains two weaknesses: the top layer feature is not position-sensitive which is harmful for the decoder to generate precise spatial attention for object of interest; irrelevant features will mislead the decoder into focusing irrelevant regions. To tackle these weaknesses, we propose Feature Selection and Fusion Network (FSFN). Specifically, to tackle the first weakness, Feature Fusion module is proposed to generate fine-grained and position-sensitive features by fusing multi-scale features. To handle the second weakness, Feature Selection module is proposed to select more informative features which will prevent the decoder from focusing on irrelevant regions. Extensive experiments demonstrate that our model has successfully addressed the above two weaknesses and can achieve comparable results with the state-of-the-art under cross entropy loss without any bells and whistles on MSCOCO dataset. Furthermore, our model can improve the performance under different encoders and decoders.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)
Google Scholar
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Summarization@ACL (2005)
Google Scholar
Chen, L., et al.: SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: CVPR (2017)
Google Scholar
Chen, S., Zhao, Q.: Boosted attention: leveraging human attention for image captioning. In: ECCV (2018)
Chapter Google Scholar
Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. In: NIPS (2016)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
Jiang, W., Ma, L., Jiang, Y.-G., Liu, W., Zhang, T.: Recurrent fusion network for image captioning. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 510–526. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_31
Chapter Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 664–676 (2017)
Article Google Scholar
Li, Z., Peng, C., Yu, G., Zhang, X., Deng, Y., Sun, J.: DetNet: design backbone for object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 339–354. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_21
Chapter Google Scholar
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out (2004)
Google Scholar
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, W., Rabinovich, A., Berg, A.C.: Parsenet: looking wider to see better. arXiv preprint arXiv:1506.04579 (2015)
Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: CVPR (2017)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a method for automatic evaluation of machine translation. In: ACL (2002)
Google Scholar
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)
Article Google Scholar
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: CVPR (2017)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR (2016)
Google Scholar
Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: CVPR (2015)
Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML (2015)
Google Scholar
Yang, Z., Yuan, Y., Wu, Y., Cohen, W.W., Salakhutdinov, R.R.: Review networks for caption generation. In: NIPS (2016)
Google Scholar
Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: Learning a discriminative feature network for semantic segmentation. In: CVPR (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Software, Tsinghua University, Beijing, China
Shaohan Hu, Guolong Wang, Zhipeng Li & Zheng Qin
School of Computer Science, Shanghai Jiao Tong University, Shanghai, China
Shenglei Huang

Authors

Shaohan Hu
View author publications
You can also search for this author in PubMed Google Scholar
Shenglei Huang
View author publications
You can also search for this author in PubMed Google Scholar
Guolong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhipeng Li
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Qin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zheng Qin .

Editor information

Editors and Affiliations

Australian National University, Canberra, ACT, Australia
Tom Gedeon
Murdoch University, Murdoch, WA, Australia
Kok Wai Wong
Kyungpook National University, Daegu, Korea (Republic of)
Minho Lee

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hu, S., Huang, S., Wang, G., Li, Z., Qin, Z. (2019). Delving into Precise Attention in Image Captioning. In: Gedeon, T., Wong, K., Lee, M. (eds) Neural Information Processing. ICONIP 2019. Communications in Computer and Information Science, vol 1143. Springer, Cham. https://doi.org/10.1007/978-3-030-36802-9_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-36802-9_9
Published: 05 December 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-36801-2
Online ISBN: 978-3-030-36802-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics