Abstract
Mulitimudal matching aims to establish relationship across different modalities such as image and text. Existing works mainly focus on maximizing the correlation between feature vectors extracted from the off-the-shelf models. The feature extraction and the matching are two-stage learning process. This paper presents a novel two-stream convolutional neural network that integrates the feature extraction and the matching under an end-to-end manner. Visual and textual stream are designed for feature extraction and then are concatenated with multiple shared layers for multimodal matching. The network is trained using an extreme multiclass classification loss by viewing each multimodal data as a class. Then a finetuning step is performed by a ranking constraint. Experimental results on Flickr30k datasets demonstrate the effectiveness of the proposed network for multimodal matching.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42(3), 145–175 (2001)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Wang, B., Yang, Y., Xu, X., Hanjalic, A., Shen, H. T.: Adversarial cross-modal retrieval. In: ACM International Conference on Multimedia Conference, pp. 154–162 (2017)
Huang, X., Peng, Y.: Cross-modal deep metric learning with multi-task regularization. In: IEEE International Conference on Multimedia and Expo, pp. 943–948 (2017)
Aytar, Y., Vondrick, C., Torralba, A.: See, hear, and read: deep aligned representations. arXiv preprint arXiv:1706.00932 (2017)
Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004)
Li, D., Dimitrova, N., Li, M., Sethi, I.K.: Multimedia content processing through cross-modal association. In: ACM International Conference on Multimedia, pp. 604–611 (2003)
Ma, L., Lu, Z., Shang, L., Li, H.: Multimodal convolutional neural networks for matching image and sentence. In: IEEE International Conference on Computer Vision, pp. 2623–2631 (2015)
Hu, B., Lu, Z., Li, H., Chen, Q.: Convolutional neural network architectures for matching natural language sentences. In: Advances in Neural Information Processing Systems, pp. 2042–2050 (2014)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Andrew, G., Arora, R., Bilmes, J., Livescu, K.: Deep canonical correlation analysis. In: International Conference on Machine Learning, pp. 1247–1255 (2013)
Yan, F., Mikolajczyk, K.: Deep correlation for matching images and text. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3441–3450 (2015)
Wei, Y., et al.: Cross-modal retrieval with CNN visual features: a new baseline. IEEE Trans. Cybern. 47(2), 449–460 (2017)
Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5005–5013 (2016)
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv preprint arXiv:1412.6632 (2014)
Huang, Y., Wang, W., Wang, L.: Instance-aware image and sentence matching with selective multimodal LSTM. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2310–2318 (2017)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for Richer image-to-sentence models. In: IEEE International Conference on Computer Vision, pp. 2641–2649 (2015)
Liu, Y., Guo, Y., Bakker, E.M., Lew, M.S.: Learning a recurrent residual fusion network for multimodal matching. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4107–4116 (2017)
Acknowledgments
This work was supported by the National Natural Science Foundation of China under Grants No. 61771145 and No. 61371148.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, Y., Gu, Y., Gu, X. (2018). Two-Stream Convolutional Neural Network for Multimodal Matching. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds) Artificial Neural Networks and Machine Learning – ICANN 2018. ICANN 2018. Lecture Notes in Computer Science(), vol 11139. Springer, Cham. https://doi.org/10.1007/978-3-030-01418-6_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-01418-6_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01417-9
Online ISBN: 978-3-030-01418-6
eBook Packages: Computer ScienceComputer Science (R0)