Skip to main content

X-GACMN: An X-Shaped Generative Adversarial Cross-Modal Network with Hypersphere Embedding

  • Conference paper
  • First Online:
Computer Vision – ACCV 2018 (ACCV 2018)

Abstract

How to bridge heterogeneous gap between different modalities is one of the main challenges in cross-modal retrieval task. Most existing methods try to tackle this problem by projecting data from different modalities into a common space. In this paper, we introduce a novel X-Shaped Generative Adversarial Cross-Modal Network (X-GACMN) to learn a better common space between different modalities. Specifically, the proposed architecture combines the process of synthetic data generation and distribution adapting into a unified framework to make sure the heterogeneous modality distributions similar to each other in the learned common subspace. To promote the discriminative ability, a new loss function that combines intra-modality angular softmax loss and cross-modality pair-wise consistent loss is further imposed on the common space, hence the learned features can well preserve both inter-modality structure and intra-modality structure on a hypersphere manifold. Extensive experiments on three benchmark datasets show the effectiveness of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Chua, T.S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.: NUS-WIDE: a real-world web image database from national University of Singapore. In: Proceedings of the CIVR, pp. 48:1–48:9 (2009)

    Google Scholar 

  2. Eisenschtat, A., Wolf, L.: Linking image and text with 2-way nets. In: Proceedings of the CVPR, pp. 4601–4611 (2017)

    Google Scholar 

  3. Erin Liong, V., Lu, J., Tan, Y.P., Zhou, J.: Cross-modal deep variational hashing. In: Proceedings of the ICCV, pp. 4077–4085 (2017)

    Google Scholar 

  4. Feng, F., Wang, X., Li, R.: Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the ACM MM, pp. 7–16 (2014)

    Google Scholar 

  5. Frome, A., et al.: Devise: a deep visual-semantic embedding model. In: Proceedings of the NIPS, pp. 2121–2129 (2013)

    Google Scholar 

  6. Gong, Y., Ke, Q., Isard, M., Lazebnik, S.: A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV 106(2), 210–233 (2014)

    Article  Google Scholar 

  7. Goodfellow, I.J., et al.: Generative adversarial nets. In: Proceedings of the NIPS, pp. 2672–2680 (2014)

    Google Scholar 

  8. Grangier, D., Bengio, S.: A discriminative kernel-based approach to rank images from text queries. IEEE TPAMI 30(8), 1371–1384 (2008)

    Article  Google Scholar 

  9. Gu, J., Cai, J., Joty, S.R., Niu, L., Wang, G.: Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the CVPR, pp. 7181–7189 (2018)

    Google Scholar 

  10. Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004)

    Article  Google Scholar 

  11. Hu, R., Andreas, J., Rohrbach, M., Darrell, T., Saenko, K.: Learning to reason: end-to-end module networks for visual question answering. In: Proceedings of the ICCV, pp. 804–813 (2017)

    Google Scholar 

  12. Jiang, Q.Y., Li, W.J.: Deep cross-modal hashing. In: Proceedings of the CVPR, pp. 3270–3278 (2017)

    Google Scholar 

  13. Li, Y., Zhang, J., Huang, K., Zhang, J.: Mixed supervised object detection with robust objectness transfer. IEEE TPAMI 99, 1–18 (2018)

    Google Scholar 

  14. Liang, J., Cao, D., He, R., Sun, Z., Tan, T.: Principal affinity based cross-modal retrieval. In: Proceedings of the ACPR, pp. 126–130 (2015)

    Google Scholar 

  15. Liang, J., He, R., Sun, Z., Tan, T.: Group-invariant cross-modal subspace learning. In: Proceedings of the IJCAI, pp. 1739–1745 (2016)

    Google Scholar 

  16. Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: SphereFace: deep hypersphere embedding for face recognition. In: Proceedings of the CVPR, pp. 212–220 (2017)

    Google Scholar 

  17. Liu, W., Wen, Y., Yu, Z., Yang, M.: Large-margin softmax loss for convolutional neural networks. In: Proceedings of the ICML, pp. 507–516 (2016)

    Google Scholar 

  18. Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the CVPR, pp. 3242–3250 (2017)

    Google Scholar 

  19. Lu, X., Wu, F., Tang, S., Zhang, Z., He, X., Zhuang, Y.: A low rank structural large margin method for cross-modal ranking. In: Proceedings of the SIGIR, pp. 433–442 (2013)

    Google Scholar 

  20. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the ICML, pp. 689–696 (2011)

    Google Scholar 

  21. Peng, Y., Huang, X., Qi, J.: Cross-media shared representation by hierarchical learning with multiple deep networks. In: Proceedings of the IJCAI, pp. 3846–3853 (2016)

    Google Scholar 

  22. Peng, Y., Qi, J., Huang, X., Yuan, Y.: CCL: cross-modal correlation learning with multi-grained fusion by hierarchical network. IEEE TMM 20(2), 405–420 (2017)

    Google Scholar 

  23. Peng, Y., Qi, J., Yuan, Y.: CM-GANs: cross-modal generative adversarial networks for common representation learning. arXiv preprint arxiv:1710.05106 (2017)

  24. Pereira, J.C., et al.: On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE TPAMI 36(3), 521–535 (2014)

    Article  Google Scholar 

  25. Quadrianto, N., Lampert, C.H.: Learning multi-view neighborhood preserving projections. In: Proceedings of the ICML, pp. 425–432 (2011)

    Google Scholar 

  26. Rashtchian, C., Young, P., Hodosh, M., Hockenmaier, J.: Collecting image annotations using Amazon’s mechanical turk. In: NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 139–147 (2010)

    Google Scholar 

  27. Rasiwasia, N., et al.: A new approach to cross-modal multimedia retrieval. In: Proceedings of the ACM MM, pp. 251–260 (2010)

    Google Scholar 

  28. Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: Proceedings of the ICML, pp. 1060–1069 (2016)

    Google Scholar 

  29. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arxiv:1409.1556 (2014)

  30. Socher, R., Karpathy, A., Le, Q.V., Manning, C.D., Ng, A.Y.: Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Linguist. 2(1), 207–218 (2014)

    Article  Google Scholar 

  31. Srivastava, N., Salakhutdinov, R.R.: Multimodal learning with deep Boltzmann machines. In: Proceedings of the NIPS, pp. 2639–2664 (2012)

    Google Scholar 

  32. Su, J., Zeng, J., Xiong, D., Liu, Y., Wang, M., Xie, J.: A hierarchy-to-sequence attentional neural machine translation model. IEEE TASLP 26(3), 623–632 (2018)

    Google Scholar 

  33. Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: Proceedings of the CVPR, pp. 2962–2971 (2017)

    Google Scholar 

  34. Wang, B., Yang, Y., Xu, X., Hanjalic, A., Shen, H.T.: Adversarial cross-modal retrieval. In: Proceedings of the ACM MM, pp. 154–162 (2017)

    Google Scholar 

  35. Wang, K., He, R., Wang, L., Wang, W., Tan, T.: Joint feature selection and subspace learning for cross-modal retrieval. IEEE TPAMI 38(10), 2010–2023 (2016)

    Article  Google Scholar 

  36. Wang, K., He, R., Wang, W., Wang, L., Tan, T.: Learning coupled feature spaces for cross-modal matching. In: Proceedings of the ICCV, pp. 2088–2095 (2013)

    Google Scholar 

  37. Wang, K., Yin, Q., Wang, W., Wu, S., Wang, L.: A comprehensive survey on cross-modal retrieval. arXiv preprint arxiv:1607.06215 (2016)

  38. Wang, W., Yang, X., Ooi, B.C., Zhang, D., Zhuang, Y.: Effective deep learning-based multi-modal retrieval. VLDBJ 25(1), 79–101 (2016)

    Article  Google Scholar 

  39. Wen, Y., Zhang, K., Li, Z., Qiao, Y.: A discriminative feature learning approach for deep face recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 499–515. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_31

    Chapter  Google Scholar 

  40. Yuan, Z., Sang, J., Liu, Y., Xu, C.: Latent feature learning in social media network. In: Proceedings of the ACM MM, pp. 253–263 (2013)

    Google Scholar 

  41. Zhai, D., Chang, H., Shan, S., Chen, X., Gao, W.: Multiview metric learning with global consistency and local smoothness. ACM Trans. Intell. Syst. Technol. 3(3), 53:1–53:22 (2012)

    Article  Google Scholar 

  42. Zhai, X., Peng, Y., Xiao, J.: Learning cross-media joint representation with sparse and semisupervised regularization. IEEE TCSVT 24(6), 965–978 (2014)

    Google Scholar 

  43. Zhai, X., Peng, Y., Xiao, J.: Heterogeneous metric learning with joint graph regularization for cross-media retrieval. In: Proceedings of the AAAI, pp. 1198–1204 (2013)

    Google Scholar 

  44. Zhu, L., Chen, Y., Ghamisi, P., Benediktsson, J.A.: Generative adversarial networks for hyperspectral image classification. IEEE TGARS 56(9), 5046–5063 (2018)

    Google Scholar 

Download references

Acknowledgement

We would like to thank anonymous reviewers for their helpful comments on the paper. This research was supported by the National Natural Science Foundation of China (NSFC) under Grant 61772111.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiangwei Kong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Guo, W., Liang, J., Kong, X., Song, L., He, R. (2019). X-GACMN: An X-Shaped Generative Adversarial Cross-Modal Network with Hypersphere Embedding. In: Jawahar, C., Li, H., Mori, G., Schindler, K. (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science(), vol 11365. Springer, Cham. https://doi.org/10.1007/978-3-030-20873-8_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-20873-8_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-20872-1

  • Online ISBN: 978-3-030-20873-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics