Multivariate Attention Network for Image Captioning

Wang, Weixuan; Chen, Zhihong; Hu, Haifeng

doi:10.1007/978-3-030-20876-9_37

Weixuan Wang¹⁸,
Zhihong Chen¹⁸ &
Haifeng Hu ORCID: orcid.org/0000-0002-4884-323X¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11366))

Included in the following conference series:

Asian Conference on Computer Vision

1893 Accesses

Abstract

Recently, attention mechanism has been used extensively in computer vision to deeper understand image through selectively local analysis. However, the existing methods apply attention mechanism individually, which leads to irrelevant or inaccurate words. To solve this problem, we propose a Multivariate Attention Network (MAN) for image captioning, which contains a content attention for identifying content information of objects, a position attention for locating positions of important patches, and a minutia attention for preserving fine-grained information of target objects. Furthermore, we also construct a Multivariate Residual Network (MRN) to integrate the more discriminative multimodal representation via modeling the projections and extracting relevant relations among visual information of different modalities. Our MAN is inspired by the latest achievements in neuroscience, and designed to mimic the treatment of visual information on human brain. Compared with previous methods, we apply diverse visual information and exploit several multimodal integration strategies, which can significantly improve the performance of our model. The experimental results show that our MAN model outperforms the state-of-the-art approaches on two benchmark datasets MS-COCO and Flickr30K.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M.: Bottom-up and top-down attention for image captioning and visual question answering. In: Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Benyounes, H., Cadene, R., Cord, M., Thome, N.: Mutan: multimodal tucker fusion for visual question answering. In: International Conference on Computer Vision, pp. 2612–2620 (2017)
Google Scholar
Chen, L., et al.: SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: Computer Vision and Pattern Recognition, pp. 5659–5667 (2017)
Google Scholar
Denkowski, M., Lavie, A.: Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems. In: Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 85–91. Association for Computational Linguistics (2011)
Google Scholar
Flick, C.: ROUGE: a package for automatic evaluation of summaries. In: The Workshop on Text Summarization Branches Out, p. 10 (2004)
Google Scholar
Fu, K., Jin, J., Cui, R., Sha, F., Zhang, C.: Aligning where to see and what to tell: image captioning with region-based attention and scene-specific contexts. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2321–2334 (2017)
Article Google Scholar
Gan, Z., et al.: Semantic compositional networks for visual captioning. In: Computer Vision and Pattern Recognition, pp. 5630–5639 (2017)
Google Scholar
Graves, A.: Long short-term memory. In: Graves, A. (ed.) Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence, vol. 385, pp. 37–45. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-24797-2_4
Chapter MATH Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Google Scholar
Kim, J.H., et al.: Multimodal residual learning for visual QA. In: Advances in Neural Information Processing Systems, pp. 361–369 (2016)
Google Scholar
Kim, J.H., On, K.W., Lim, W., Kim, J., Ha, J.W., Zhang, B.T.: Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325 (2016)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, S., Zhu, Z., Ye, N., Guadarrama, S., Murphy, K.: Improved image captioning via policy gradient optimization of spider. In: International Conference on Computer Vision, pp. 873–881 (2017)
Google Scholar
Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Computer Vision and Pattern Recognition, pp. 375–383 (2017)
Google Scholar
Lu, J., Yang, J., Batra, D., Parikh, D.: Neural baby talk. In: Computer Vision and Pattern Recognition, pp. 7219–7228 (2018)
Google Scholar
Lu, Y., et al.: Revealing detail along the visual hierarchy: neural clustering preserves acuity from v1 to v4. Neuron 98(2), 417–428 (2018)
Article Google Scholar
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv preprint arXiv:1412.6632 (2014)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)
Google Scholar
Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 (2015)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: International Conference on Neural Information Processing Systems, pp. 91–99 (2015)
Google Scholar
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: Computer Vision and Pattern Recognition, pp. 7008–7024 (2017)
Google Scholar
Ungerleider, L.G., Haxby, J.V.: ‘what’ and ‘where’ in the human brain. Curr. Opin. Neurobiol. 4(2), 157–165 (1994)
Article Google Scholar
Vedantam, R., Zitnick, C.L., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Google Scholar
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Li, F.F., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017)
Article MathSciNet Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)
Google Scholar
Wang, Y., Lin, Z., Shen, X., Cohen, S., Cottrell, G.W.: Skeleton key: image captioning by skeleton-attribute decomposition. In: Computer Vision and Pattern Recognition, pp. 7272–7281 (2017)
Google Scholar
Wu, Q., Shen, C., Liu, L., Dick, A., Hengel, A.V.D.: What value do explicit high level concepts have in vision to language problems? In: Computer Vision and Pattern Recognition, pp. 203–212 (2016)
Google Scholar
Xu, K., Ba, J.L., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
Google Scholar
You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: Computer Vision and Pattern Recognition, pp. 4651–4659 (2016)
Google Scholar
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
Article Google Scholar
Zhang, L., et al.: Actor-critic sequence training for image captioning. arXiv preprint arXiv:1706.09601 (2017)

Download references

Acknowledgement

This work was supported in part by the NSFC (61673402), the NSF of Guangdong Province (2017A030311029), the Science and Technology Program of Guangzhou (201704020180), and the Fundamental Research Funds for the Central Universities of China.

Author information

Authors and Affiliations

School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, 510006, China
Weixuan Wang, Zhihong Chen & Haifeng Hu

Authors

Weixuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhihong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Haifeng Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haifeng Hu .

Editor information

Editors and Affiliations

IIIT Hyderabad, Hyderabad, India
C.V. Jawahar
ANU, Canberra, ACT, Australia
Hongdong Li
Simon Fraser University, Burnaby, BC, Canada
Greg Mori
ETH Zurich, Zurich, Zürich, Switzerland
Konrad Schindler

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, W., Chen, Z., Hu, H. (2019). Multivariate Attention Network for Image Captioning. In: Jawahar, C., Li, H., Mori, G., Schindler, K. (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science(), vol 11366. Springer, Cham. https://doi.org/10.1007/978-3-030-20876-9_37

Download citation

DOI: https://doi.org/10.1007/978-3-030-20876-9_37
Published: 26 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20875-2
Online ISBN: 978-3-030-20876-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics