Semi-supervised cross-modal learning for cross modal retrieval and image annotation

  • Fuhao ZouEmail author
  • Xingqiang Bai
  • Chaoyang Luan
  • Kai Li
  • Yunfei Wang
  • Hefei Ling
Part of the following topical collections:
  1. Special Issue on Deep vs. Shallow: Learning for Emerging Web-scale Data Computing and Applications


Multimedia data are usually associated with multiple modalities represented by heterogeneous features. Recently, many information retrieval tasks are not only restricted to the case of a single modal and the contend-based cross modal retrieval has become one of the popular research fields. The premise of cross modal retrieval is discovering the relationships between different modalities efficiently. Though some approaches have been proposed to address this challenging problem, they either ignores the precious labels, or heavily depends on the completely labeled training data. In addition, for features with relatively high dimensionality, it is of great importance to select the most informative ones. In this paper, we propose a semi-supervised algorithm for cross modal learning. Our algorithm can make full use of both a small number of labeled and an abundant unlabeled data to establish connections between modalities via a shared semantic space discovering. On the other hand, our algorithm automatically filter out the noisy and redundant features to further improve our model. Finally, we give an efficient solution to the objective function. The experiments on two publicly available datasets demonstrate that the proposed method is competitive with or even superior to the state-of-art counterparts.


Cross modal Low rank Sparse learning 



This work is supported in part by the National Natural Science Foundation of China under Grant No.61672254 and 61300222, Key project of National Natural Science Foundation of China Grant No U1536203, Natural Science Foundation of Hubei Province Grant No.2015CFB687, the Fundamental Research Funds for the Central Universities, HUST:2016YXMS088. The authors appreciate the valuable suggestions from the anonymous reviewers and the Editors.


  1. 1.
    Argyriou, A., Evgeniou, T., Pontil, M.: Multi-task feature learning. Adv. Neural Inf. Proces. Syst. 191(41), 41–50 (2007). MITGoogle Scholar
  2. 2.
    Bandla, S., Grauman, K.: Active learning of an action detector from untrimmed videos. In: IEEE International Conference on Computer Vision (ICCV). IEEE (2013)Google Scholar
  3. 3.
    Barnard, K., Duygulu, P., Forsyth, D., De Freitas, N., Blei, D.M., Jordan, M.I.: Matching words and pictures. J. Mach. Learn. Res. 3, 1107–1135 (2003). JMLR. orgzbMATHGoogle Scholar
  4. 4.
    Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 7, 2399–2434 (2006). JMLR orgMathSciNetzbMATHGoogle Scholar
  5. 5.
    Berg, T.L., Berg, A.C., Edwards, J., Maire, M., White, R., Teh, Y.W., Earned-Miller, E.G., Forsyth, D.A.: Names and faces in the news. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR) 2, 848–854 (2004)Google Scholar
  6. 6.
    Blei, D.M., Jordan, M.I.: Modeling annotated data. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 127–134. ACM (2003)Google Scholar
  7. 7.
    Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001). Taylor & FrancisMathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Fazel, M.: Matrix Rank Minimization with Applications. PhD thesis, Stanford University (2002)Google Scholar
  9. 9.
    Fazel, M., Hindi, H., Boyd, S.P.: A rank minimization heuristic with application to minimum order system approximation. Proc. 2001 Am. Control Conf. 6 (1), 4734–4739 (2001). IEEECrossRefGoogle Scholar
  10. 10.
    Grave, E., Obozinski, G., Bach, F., et al.: Trace Lasso: a trace norm regularization for correlated designs. NIPS 3(2), 5–5 (2011)Google Scholar
  11. 11.
    Guillaumin, M., Verbeek, J., Schmid, C.: Multimodal semi-supervised learning for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 902–909. IEEE (2010)Google Scholar
  12. 12.
    Hwang, S.J., Grauman, K.: Reading between the lines: Object localization using implicit cues from image tags. IEEE Trans. Pattern Anal. Mach. Intell. 34(6), 1145–1158 (2012). IEEECrossRefGoogle Scholar
  13. 13.
    Jia, Y., Salzmann, M., Trevor, D.: Learning cross-modality similarity for multinomial data. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 2407–2414. IEEE (2011)Google Scholar
  14. 14.
    Jingdong, W., Ting, Z., Jingkuan, S., Nicu, S., Tao, S.H.: A survey on learning to hash. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 769–790 (2018)CrossRefGoogle Scholar
  15. 15.
    Jingkuan, S., Hanwang, Z., Xiangpeng, L., Lianli, G., Meng, W., Richang, H.: Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Trans. Image Process. 20(3), 233–50 (2018)Google Scholar
  16. 16.
    Li, A., Shan, S., Chen, X., Gao, W: Face recognition based on non-corresponding region matching. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 1060–1067. IEEE (2011)Google Scholar
  17. 17.
    Li, Z., Qin, L., Cheng, H., Zhang, X., Zhou, X.: TRIP: An interactive retrieving-inferring data imputation approach. IEEE Trans. Knowl. Data Eng. 27(9), 2550–2563 (2015)CrossRefGoogle Scholar
  18. 18.
    Li, Z., Sharaf, M.A., Sitbon, L., Du, X., Zhou, X.: CoRE: A context-aware relation extraction method for relation completion. IEEE Trans. Knowl. Data Eng. 26 (4), 836–49 (2014)CrossRefGoogle Scholar
  19. 19.
    Li, Z., Sitbon, L., Wang, L., Zhou, X., Du, X.: AML: efficient approximate membership localization within a Web-based join framework. IEEE Trans. Knowl. Data Eng. 25(2), 298–310 (2013)CrossRefGoogle Scholar
  20. 20.
    Liu, J., Ji, S., Ye, J.: Multi-task feature learning via efficient l 2, 1-norm minimization. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 339–348 (2009)Google Scholar
  21. 21.
    Ma, Z., Yang, Y., Cai, Y., Sebe, N., Hauptmann, A.G.: Knowledge adaptation for ad hoc multimedia event detection with few exemplars. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 469–478. ACM (2012)Google Scholar
  22. 22.
    Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 689–696 (2011)Google Scholar
  23. 23.
    Nie, F., Huang, H., Cai, X., Ding, C.: Efficient and robust feature selection via joint l2, 1-norms minimization. Adv. Neural Inf. Proces. Syst. 23(1), 1813–1821 (2010)Google Scholar
  24. 24.
    Obozinski, G., Taskar, B., Jordan, M.I.: Joint covariate selection and joint subspace selection for multiple classification problems. Stat. Comput. 20(2), 231–252 (2010). SpringerMathSciNetCrossRefGoogle Scholar
  25. 25.
    Putthividhy, D., Attias, H.T., Nagarajan, S.S.: Topic regression multi-modal latent dirichlet allocation for image annotation. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3408–3415. IEEE (2010)Google Scholar
  26. 26.
    Rasiwasia, N., Costa Pereira, J., Coviello, E., Doyle, G., Lanckriet, G.R.G., Levy, R., Vasconcelos, N.: A new approach to cross-modal multimedia retrieval. In: Proceedings of the International Conference on Multimedia, pp. 251–260. ACM (2010)Google Scholar
  27. 27.
    Sharma, A., Jacobs, D.W.: Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch. In: Proceedings 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 593–600. IEEE (2011)Google Scholar
  28. 28.
    Sharma, A., Kumar, A., Daume, H., Jacobs, D.W.: Generalized multiview analysis: a discriminative latent spacer. In: Proceeding of 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2160–2167. IEEE (2012)Google Scholar
  29. 29.
    Socher, R., Fei-Fei, L.: Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 966–973. IEEE (2010)Google Scholar
  30. 30.
    Song, J., Gao, L., Liu, L., Zhu, X., Sebe, N.: Quantization-based hashing: a general framework for scalable image and video retrieval. Pattern Recogn. 75, 1339–1351 (2018)CrossRefGoogle Scholar
  31. 31.
    Wei, Z., Ke, Z., Pan, G., Xiangyang, X.: Multi-view embedding learning for incompletely labeled data. In: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, pp. 1910–1916. AAAI Press (2013)Google Scholar
  32. 32.
    Yang, Y., Nie, F., Xu, D., Luo, J., Zhuang, Y., Pan, Y.: A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans. Pattern Anal. Mach. Intell. 34(4), 723–742 (2012). IEEECrossRefGoogle Scholar
  33. 33.
    Yang, Y., Song, J., Huang, Z., Ma, Z., Sebe, N., Hauptmann, A.G.: Multi-feature fusion via hierarchical regression for multimedia analysis. IEEE Trans. Multimedia 15(3), 572–581 (2013). IEEECrossRefGoogle Scholar
  34. 34.
    Zhou, N., Zhu, J.: Group variable selection via a hierarchical lasso and its oracle property, arXiv:1006.2871 (2010)
  35. 35.
    Zhu, L., Huang, Z., Liu, X., He, X., Sun, J., Zhou, X.: Discrete multimodal hashing with canonical views for robust mobile landmark search. IEEE Trans. Multimedia 19(9), 2066–2079 (2017)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Fuhao Zou
    • 1
    Email author
  • Xingqiang Bai
    • 1
  • Chaoyang Luan
    • 1
  • Kai Li
    • 1
  • Yunfei Wang
    • 1
  • Hefei Ling
    • 1
  1. 1.School of Computer Science and TechnologyHuazhong University of Science and TechnologyWuhanChina

Personalised recommendations