Multimedia Tools and Applications

, Volume 74, Issue 2, pp 613–634 | Cite as

Markov random field based fusion for supervised and semi-supervised multi-modal image classification

  • Liang Xie
  • Peng PanEmail author
  • Yansheng Lu


In recent years, there has been a massive explosion of multimedia content on the web, multi-modal examples such as images associated with tags can be easily accessed from social website such as Flickr. In this paper, we consider two classification tasks: supervised and semi-supervised multi-modal image classification, to take advantage of the increasing multi-modal examples on the web. We first propose a Markov random field (MRF) based fusion method: discriminative probabilistic graphical fusion (DPGF) for the supervised multi-modal image classification, which can make use of the associated tags to enhance the classification performance. Based on DPGF, we then propose a three-step learning procedure: DPGF+RLS+SVM, for the semi-supervised multi-modal image classification, which uses both the labeled and unlabeled examples for training. Experimental results on two datasets: PASCAL VOC’07 and MIR Flickr, show that our methods can well exploit the multi-modal data and unlabeled examples, and they also outperform previous state-of-the-art methods in both two multi-modal image classification. Finally we consider the weakly supervised condition where class labels are from image tags which are noisy. Our semi-supervised approach also improves the classification performance in this case.


Multi-modal classification Image classification Semi-supervised learning Markov random field 


  1. 1.
    Atrey PK, Hossain MA, El Saddik A, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: a survey. Multimed Syst 16(6):345–379CrossRefGoogle Scholar
  2. 2.
    Bach FR, Lanckriet GRG, Jordan MI (2004) Multiple kernel learning, conic duality, and the SMO algorithm. In: Proceedings of the 21st international conference on machine learning. ACM, p 6Google Scholar
  3. 3.
    Baluja S (1998) Probabilistic modeling for face orientation discrimination: learning from labeled and unlabeled data. NIPSGoogle Scholar
  4. 4.
    Barla A, Odone F, Verri A (2003) Histogram intersection kernel for image classification. In: Proceedings of the international conference on image processing, ICIP 2003, vol 3. IEEEGoogle Scholar
  5. 5.
    Belkin M, Niyogi P, Sindhwani V (2006) Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J Mach Learn Res 7:2399–2434zbMATHMathSciNetGoogle Scholar
  6. 6.
    Bishop CM, Nasrabadi NM (2006) Pattern recognition and machine learning, vol 1. Springer, New YorkGoogle Scholar
  7. 7.
    Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th annual conference on computational learning theory. ACM, pp 92-100Google Scholar
  8. 8.
    Cai D, He X, Han J (2007) Semi-supervised discriminant analysis. In: IEEE 11th international conference on computer vision, ICCV 2007. IEEE, pp 1–7Google Scholar
  9. 9.
    Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27Google Scholar
  10. 10.
    Chang S-F, Manmatha R, Chua T-S (2005) Combining text audio-visual features in video indexing. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP’05), vol 5. IEEEGoogle Scholar
  11. 11.
    Chapelle O, Haffner P, Vapnik VN (1999) Support vector machines for histogram-based image classification. IEEE Trans Neural Netw 10(5):1055–1064CrossRefGoogle Scholar
  12. 12.
    Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297zbMATHGoogle Scholar
  13. 13.
    Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE computer society conference on computer vision and pattern recognition, 2005, CVPR, vol 1. IEEE, pp 886–893Google Scholar
  14. 14.
    Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2007) The PASCAL Visual Object Classes Challenge (VOC2007) Results.
  15. 15.
    Gao Y, Wang M, Zha Z-J, Shen J, Li X, Wu X (2013) Visual-textual joint relevance learning for tag-based social image search, p 1Google Scholar
  16. 16.
    Goumehei E, Tolpekin VA (2010) Contextual image classification with support vector machineGoogle Scholar
  17. 17.
    Guillaumin M, Verbeek J, Schmid C (2010) Multimodal semi-supervised learning for image classification. In: IEEE conference on computer vision and pattern recognition (CVPR). IEEEGoogle Scholar
  18. 18.
    Hammersley JM, Clifford P (1968) Markov fields on finite graphs and latticesGoogle Scholar
  19. 19.
    Huiskes MJ, Lew MS (2008) The MIR flickr retrieval evaluation. In: Proceedings of the 1st ACM international conference on multimedia information retrieval. ACMGoogle Scholar
  20. 20.
    Chapelle O, Scholkopf B, Zien A (eds) (2006) Semi-supervised learning. MIT press, CambridgeGoogle Scholar
  21. 21.
    Iyengar G, Nock H, Neti C, Franz M (2002) In: Proceedings of IEEE international conference on multimedia and expo, 2002 ICME02, vol 2. IEEE, pp 369–372Google Scholar
  22. 22.
    Kawanabe M, Binder A, Muller C, Wojcikiewicz W (2011) Multi-modal visual concept classification of images via Markov random walk over tags. In: IEEE workshop on applications of computer vision (WACV). IEEE, pp 396–401Google Scholar
  23. 23.
    Li S Z (1995) Markov random field modeling in computer vision. Springer, New YorkCrossRefGoogle Scholar
  24. 24.
    Li Y, Crandall DJ, Huttenlocher DP (2009) Landmark classification in large-scale image collections. In: IEEE 12th international conference on computer vision. IEEE, pp 1957–1964Google Scholar
  25. 25.
    Lienhart R, Romberg S, H?rster E (2009) Multilayer pLSA for multimodal image retrieval. In: Proceedings of the ACM international conference on image and video retrieval. ACM, p 9Google Scholar
  26. 26.
    Lin HT, Lin CJ, Weng RC (2007) A note on Platts probabilistic outputs for support vector machines[J]. Mach Learn 68(3):267–276CrossRefGoogle Scholar
  27. 27.
    Liu N, Dellandrea E, Zhu C, Bichot C-E, Chen L (2012) A selective weighted late fusion for visual concept recognition. In: Workshops and demonstrations omputer Vision CECCV. Springer, Berlin Heidelberg, pp 426–435Google Scholar
  28. 28.
    Nigam K, McCallum AK, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39(2–3):103–134CrossRefzbMATHGoogle Scholar
  29. 29.
    Nowak E, Jurie F, Triggs B (2006) Sampling strategies for bag-of-features image classification. Computer Vision CECCV 2006. Springer, Berlin Heidelberg, pp 490–503Google Scholar
  30. 30.
    Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. In: International journal of computer vision 42.3, pp 145–175Google Scholar
  31. 31.
    Pang Y, Ma Z, Yuan Y, Li X, Wang K (2011) Multimodal learning for multi-label image classification. In: 18th IEEE international conference on image processing (ICIP), 2011. IEEE, pp 1797–1800Google Scholar
  32. 32.
    Papadopoulos S, Zigkolis C, Kompatsiaris Y, Vakali A (2010) Cluster-based landmark and event detection on tagged photo collections. IEEE MultimediaGoogle Scholar
  33. 33.
    Perronnin F, Snchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: Computer Vision CECCV 2010. Springer, Berlin Heidelberg, pp 143–156Google Scholar
  34. 34.
    Sindhwani V, Niyogi P, Belkin M (2005) A co-regularization approach to semi-supervised learning with multiple views. In: Proceedings of ICML workshop on learning with multiple views, pp 74–79Google Scholar
  35. 35.
    Snoek CGM, Worring M, Arnold WMS (2005) Early versus late fusion in semantic video analysis. In: Proceedings of the 13th annual ACM international conference on multimedia. ACMGoogle Scholar
  36. 36.
    Srivastava N, Salakhutdinov R (2012) Multimodal learning with deep Boltzmann machines. In: Advances in neural information processing systems, p 25Google Scholar
  37. 37.
    Sun S (2011) Multi-view Laplacian support vector machines. In: Advanced data mining and applications. Springer, Berlin Heidelberg, pp 209–222Google Scholar
  38. 38.
    Verbeek J, Guillaumin M, Mensink T et al (2010) Image annotation with tagprop on the MIRFLICKR set. In: Proceedings of the international conference on multimedia information retrieval. ACM, pp 537–546Google Scholar
  39. 39.
    Wang G, Hoiem D, Forsyth D (2009) Building text features for object image classification. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009, IEEE, pp 1367–1374Google Scholar
  40. 40.
    Wang J, Yang J, Kai Y, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification. In: IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 3360–3367Google Scholar
  41. 41.
    Xiang Y, Zhou X, Chua T-S, Ngo C-W (2009) A revisit of generative model for automatic image annotation using markov random fields. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009, IEEE, pp 1153–1160Google Scholar
  42. 42.
    Yang J, Li Y, Tian Y, Duan L, Gao W (2009) Group-sensitive multiple kernel learning for object categorization. In: IEEE 12th international conference on computer vision. IEEE, pp 436–443Google Scholar
  43. 43.
    Znaidia A, Shabou A, Popescu A, Le Borgne H, Hudelot C (2012) Multimodal feature generation framework for semantic image classification. In: Proceedings of the 2nd ACM international conference on multimedia retrieval. ACMGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.School of Computer Science and TechnologyHuazhong University of Science and TechnologyWuhanChina

Personalised recommendations