Abstract
How to bridge heterogeneous gap between different modalities is one of the main challenges in cross-modal retrieval task. Most existing methods try to tackle this problem by projecting data from different modalities into a common space. In this paper, we introduce a novel X-Shaped Generative Adversarial Cross-Modal Network (X-GACMN) to learn a better common space between different modalities. Specifically, the proposed architecture combines the process of synthetic data generation and distribution adapting into a unified framework to make sure the heterogeneous modality distributions similar to each other in the learned common subspace. To promote the discriminative ability, a new loss function that combines intra-modality angular softmax loss and cross-modality pair-wise consistent loss is further imposed on the common space, hence the learned features can well preserve both inter-modality structure and intra-modality structure on a hypersphere manifold. Extensive experiments on three benchmark datasets show the effectiveness of the proposed approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Chua, T.S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.: NUS-WIDE: a real-world web image database from national University of Singapore. In: Proceedings of the CIVR, pp. 48:1–48:9 (2009)
Eisenschtat, A., Wolf, L.: Linking image and text with 2-way nets. In: Proceedings of the CVPR, pp. 4601–4611 (2017)
Erin Liong, V., Lu, J., Tan, Y.P., Zhou, J.: Cross-modal deep variational hashing. In: Proceedings of the ICCV, pp. 4077–4085 (2017)
Feng, F., Wang, X., Li, R.: Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the ACM MM, pp. 7–16 (2014)
Frome, A., et al.: Devise: a deep visual-semantic embedding model. In: Proceedings of the NIPS, pp. 2121–2129 (2013)
Gong, Y., Ke, Q., Isard, M., Lazebnik, S.: A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV 106(2), 210–233 (2014)
Goodfellow, I.J., et al.: Generative adversarial nets. In: Proceedings of the NIPS, pp. 2672–2680 (2014)
Grangier, D., Bengio, S.: A discriminative kernel-based approach to rank images from text queries. IEEE TPAMI 30(8), 1371–1384 (2008)
Gu, J., Cai, J., Joty, S.R., Niu, L., Wang, G.: Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the CVPR, pp. 7181–7189 (2018)
Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004)
Hu, R., Andreas, J., Rohrbach, M., Darrell, T., Saenko, K.: Learning to reason: end-to-end module networks for visual question answering. In: Proceedings of the ICCV, pp. 804–813 (2017)
Jiang, Q.Y., Li, W.J.: Deep cross-modal hashing. In: Proceedings of the CVPR, pp. 3270–3278 (2017)
Li, Y., Zhang, J., Huang, K., Zhang, J.: Mixed supervised object detection with robust objectness transfer. IEEE TPAMI 99, 1–18 (2018)
Liang, J., Cao, D., He, R., Sun, Z., Tan, T.: Principal affinity based cross-modal retrieval. In: Proceedings of the ACPR, pp. 126–130 (2015)
Liang, J., He, R., Sun, Z., Tan, T.: Group-invariant cross-modal subspace learning. In: Proceedings of the IJCAI, pp. 1739–1745 (2016)
Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: SphereFace: deep hypersphere embedding for face recognition. In: Proceedings of the CVPR, pp. 212–220 (2017)
Liu, W., Wen, Y., Yu, Z., Yang, M.: Large-margin softmax loss for convolutional neural networks. In: Proceedings of the ICML, pp. 507–516 (2016)
Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the CVPR, pp. 3242–3250 (2017)
Lu, X., Wu, F., Tang, S., Zhang, Z., He, X., Zhuang, Y.: A low rank structural large margin method for cross-modal ranking. In: Proceedings of the SIGIR, pp. 433–442 (2013)
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the ICML, pp. 689–696 (2011)
Peng, Y., Huang, X., Qi, J.: Cross-media shared representation by hierarchical learning with multiple deep networks. In: Proceedings of the IJCAI, pp. 3846–3853 (2016)
Peng, Y., Qi, J., Huang, X., Yuan, Y.: CCL: cross-modal correlation learning with multi-grained fusion by hierarchical network. IEEE TMM 20(2), 405–420 (2017)
Peng, Y., Qi, J., Yuan, Y.: CM-GANs: cross-modal generative adversarial networks for common representation learning. arXiv preprint arxiv:1710.05106 (2017)
Pereira, J.C., et al.: On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE TPAMI 36(3), 521–535 (2014)
Quadrianto, N., Lampert, C.H.: Learning multi-view neighborhood preserving projections. In: Proceedings of the ICML, pp. 425–432 (2011)
Rashtchian, C., Young, P., Hodosh, M., Hockenmaier, J.: Collecting image annotations using Amazon’s mechanical turk. In: NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 139–147 (2010)
Rasiwasia, N., et al.: A new approach to cross-modal multimedia retrieval. In: Proceedings of the ACM MM, pp. 251–260 (2010)
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: Proceedings of the ICML, pp. 1060–1069 (2016)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arxiv:1409.1556 (2014)
Socher, R., Karpathy, A., Le, Q.V., Manning, C.D., Ng, A.Y.: Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Linguist. 2(1), 207–218 (2014)
Srivastava, N., Salakhutdinov, R.R.: Multimodal learning with deep Boltzmann machines. In: Proceedings of the NIPS, pp. 2639–2664 (2012)
Su, J., Zeng, J., Xiong, D., Liu, Y., Wang, M., Xie, J.: A hierarchy-to-sequence attentional neural machine translation model. IEEE TASLP 26(3), 623–632 (2018)
Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: Proceedings of the CVPR, pp. 2962–2971 (2017)
Wang, B., Yang, Y., Xu, X., Hanjalic, A., Shen, H.T.: Adversarial cross-modal retrieval. In: Proceedings of the ACM MM, pp. 154–162 (2017)
Wang, K., He, R., Wang, L., Wang, W., Tan, T.: Joint feature selection and subspace learning for cross-modal retrieval. IEEE TPAMI 38(10), 2010–2023 (2016)
Wang, K., He, R., Wang, W., Wang, L., Tan, T.: Learning coupled feature spaces for cross-modal matching. In: Proceedings of the ICCV, pp. 2088–2095 (2013)
Wang, K., Yin, Q., Wang, W., Wu, S., Wang, L.: A comprehensive survey on cross-modal retrieval. arXiv preprint arxiv:1607.06215 (2016)
Wang, W., Yang, X., Ooi, B.C., Zhang, D., Zhuang, Y.: Effective deep learning-based multi-modal retrieval. VLDBJ 25(1), 79–101 (2016)
Wen, Y., Zhang, K., Li, Z., Qiao, Y.: A discriminative feature learning approach for deep face recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 499–515. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_31
Yuan, Z., Sang, J., Liu, Y., Xu, C.: Latent feature learning in social media network. In: Proceedings of the ACM MM, pp. 253–263 (2013)
Zhai, D., Chang, H., Shan, S., Chen, X., Gao, W.: Multiview metric learning with global consistency and local smoothness. ACM Trans. Intell. Syst. Technol. 3(3), 53:1–53:22 (2012)
Zhai, X., Peng, Y., Xiao, J.: Learning cross-media joint representation with sparse and semisupervised regularization. IEEE TCSVT 24(6), 965–978 (2014)
Zhai, X., Peng, Y., Xiao, J.: Heterogeneous metric learning with joint graph regularization for cross-media retrieval. In: Proceedings of the AAAI, pp. 1198–1204 (2013)
Zhu, L., Chen, Y., Ghamisi, P., Benediktsson, J.A.: Generative adversarial networks for hyperspectral image classification. IEEE TGARS 56(9), 5046–5063 (2018)
Acknowledgement
We would like to thank anonymous reviewers for their helpful comments on the paper. This research was supported by the National Natural Science Foundation of China (NSFC) under Grant 61772111.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Guo, W., Liang, J., Kong, X., Song, L., He, R. (2019). X-GACMN: An X-Shaped Generative Adversarial Cross-Modal Network with Hypersphere Embedding. In: Jawahar, C., Li, H., Mori, G., Schindler, K. (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science(), vol 11365. Springer, Cham. https://doi.org/10.1007/978-3-030-20873-8_33
Download citation
DOI: https://doi.org/10.1007/978-3-030-20873-8_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20872-1
Online ISBN: 978-3-030-20873-8
eBook Packages: Computer ScienceComputer Science (R0)