A deep multimodal generative and fusion framework for class-imbalanced multimodal data


The purpose of multimodal classification is to integrate features from diverse information sources to make decisions. The interactions between different modalities are crucial to this task. However, common strategies in previous studies have been to either concatenate features from various sources into a single compound vector or input them separately into several different classifiers that are then assembled into a single robust classifier to generate the final prediction. Both of these approaches weaken or even ignore the interactions among different feature modalities. In addition, in the case of class-imbalanced data, multimodal classification becomes troublesome. In this study, we propose a deep multimodal generative and fusion framework for multimodal classification with class-imbalanced data. This framework consists of two modules: a deep multimodal generative adversarial network (DMGAN) and a deep multimodal hybrid fusion network (DMHFN). The DMGAN is used to handle the class imbalance problem. The DMHFN identifies fine-grained interactions and integrates different information sources for multimodal classification. Experiments on a faculty homepage dataset show the superiority of our framework compared to several start-of-the-art methods.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16


  1. 1.


  2. 2.

    This model comprises 3 million 300-dimensional English word vectors and is accessible at https://code.google.com/archive/p/word2vec/


  1. 1.

    Ai C, Norton EC (2003) Interaction terms in logit and probit models. Econ Lett 80(1):123–129

    Article  Google Scholar 

  2. 2.

    Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: International conference on learning representations (ICLR)

  3. 3.

    Baltrušaitis T, Ahuja C, Morency L-P (2018) Multimodal machine learning: a survey and taxonomy. IEEE Trans Pattern Anal Mach Intell (TPAMI) 41(2):423–443

    Article  Google Scholar 

  4. 4.

    Basu A (1976) Elementary statistical theory in sociology. Brill Archive, 12

  5. 5.

    Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res (JAIR) 16:321–357

    Article  Google Scholar 

  6. 6.

    Chen T, Guestrin C (2016) XGBOOST: a scalable tree boosting system. In: Proceedings of the 22nd ACM international conference on knowledge discovery and data mining (SIGKDD), pp 785–794

  7. 7.

    Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y (2017) Deformable convolutional networks. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 764–773

  8. 8.

    Dal Pozzolo A, Caelen O, Johnson RA, Bontempi G (2015) Calibrating probability with undersampling for unbalanced classification. In: IEEE Computational intelligence, 2015 IEEE symposium series, pp 159–166

  9. 9.

    Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Computer Vision and Pattern Recognition (CVPR), IEEE, vol 1, pp 886–893

  10. 10.

    Douzas G, Bacao F (2018) Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Syst Appl (ESWA) 91:464–471

    Article  Google Scholar 

  11. 11.

    Douzas G, Bacao F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci 465:1–20

    Article  Google Scholar 

  12. 12.

    Dwibedi D, Aytar Y, Tompson J, Sermanet P, Zisserman A (2019) Temporal cycle-consistency learning. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 1801–1810

  13. 13.

    Farnadi G, Tang J, De Cock M, Moens M-F (2018) User profiling through deep multimodal fusion. In: Proceedings of the eleventh ACM international conference on web search and data mining, ACM, pp 171–179

  14. 14.

    Gao L, Guo Z, Zhang H, Xu X, Shen HT (2017) Video captioning with attention-based LSTM and semantic consistency. IEEE Trans Multimed (TMM) 19(9):2045–2055

    Article  Google Scholar 

  15. 15.

    Gao L, Li X, Song J, Shen HT (2019) Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

  16. 16.

    Goodfellow I, Pouge Abadie J, Mirza M, Xu B, Warde Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in Neural Information Processing Systems (NIPS), pp 2672–2680

  17. 17.

    Guo W, Wang J, Wang S (2019) Deep multimodal representation learning: a survey. IEEE Access 7:63373–63394

    Article  Google Scholar 

  18. 18.

    He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), IEEE, pp 1322–1328

  19. 19.

    He H, Garcia EA (2008) Learning from imbalanced data. IEEE Trans Knowl Data Eng (TKDE), (9)1263–1284

  20. 20.

    He H, Shen X (2007) A ranked subspace learning method for gene expression data classification. In: International conference on artificial intelligence (ICAI), pp 358–364

  21. 21.

    Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning (ICML)

  22. 22.

    James AP, Dasarathy BV (2014) Medical image fusion: a survey of the state of the art. Inform Fusion 19:4–19

    Article  Google Scholar 

  23. 23.

    Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal (IDA) 6(5):429–449

    Article  Google Scholar 

  24. 24.

    Khaleghi B, Khamis A, Karray FO, Razavi SN (2013) Multisensor data fusion: a review of the state-of-the-art. Inform Fusion 14(1):28–44

    Article  Google Scholar 

  25. 25.

    Kingma DP, Ba JL (2015) Adam: a method for stochastic optimization. In: International conference on learning representations (ICLR)

  26. 26.

    Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30(2-3):195–215

    Article  Google Scholar 

  27. 27.

    Li Q, Chen Y, Jiang LL, Li P, Chen H (2016) A tensor-based information framework for predicting the stock market. ACM Trans Inf Syst (TOIS) 34(2):11

    Article  Google Scholar 

  28. 28.

    Li Q, Tan J, Wang J, Chen H (2020) A multimodal event-driven LSTM model for stock prediction using online news. IEEE Transactions on Knowledge and Data Engineering (TKDE). https://doi.org/10.1109/TKDE.2020.2968894

  29. 29.

    Li Q, Wang J, Wang F, Li P, Liu L, Chen Y (2017) The role of social sentiment in stock markets: a view from joint effects of multiple information sources. Multimed Tools Appl (MTAP) 76(10):12315–12345

    Article  Google Scholar 

  30. 30.

    Li Q, Wang T, Gong Q, Chen Y, Lin Z, Song S (2014) Media-aware quantitative trading based on public web information. Decision Support Systems (DSS) 61:93–105

    Article  Google Scholar 

  31. 31.

    Louzada F, Ferreira Silva PH, Diniz CarlosAR (2012) On the impact of disproportional samples in credit scoring models: an application to a brazilian bank data. Expert Syst Appl (ESWA) 39(9):8071–8078

    Article  Google Scholar 

  32. 32.

    Mansoorizadeh M, Charkari NM (2010) Multimodal information fusion application to human emotion recognition from face and speech. Multimed Tools Appl (MTAP) 49(2):277–297

    Article  Google Scholar 

  33. 33.

    Mathieu MF, Zhao JJ, Zhao J, Ramesh A, Sprechmann P, LeCun Y (2016) Disentangling factors of variation in deep representation using adversarial training. In: Advances in Neural Information Processing Systems (NIPS), pp 5040–5048

  34. 34.

    Metz CE (1978) Basic principles of roc analysis. In: Seminars in Nuclear Medicine, vol 8. Elsevier, Amsterdam, pp 283–298

  35. 35.

    Morvant E, Habrard A, Ayache S (2014) Majority vote of diverse classifiers for late fusion. In: Joint IAPR international workshops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR). Springer, Berlin, pp 153–162

  36. 36.

    Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning. In: Proceedings of the 28th international conference on machine learning (ICML), pp 689–696

  37. 37.

    Oshri B, Khandwala N (2015) There and back again: autoencoders for textual reconstruction

  38. 38.

    Pearson R, Goney G, Shwaber J (2003) Imbalanced clustering for microarray time-series. In: Proceedings of the international conference on machine learning (ICML), vol. 3

  39. 39.

    Poria S, Cambria E, Bajpai R, Hussain A (2017) A review of affective computing: from unimodal analysis to multimodal fusion. Inform Fusion 37:98–125

    Article  Google Scholar 

  40. 40.

    Potamianos G, Neti C, Gravier G, Garg A, Senior AW (2003) Recent advances in the automatic recognition of audiovisual speech. Proc IEEE 91(9):1306–1326

    Article  Google Scholar 

  41. 41.

    Qi X, Davison BD (2009) Web page classification: features and algorithms. ACM Comput Surv (CSUR) 41(2):12

    Article  Google Scholar 

  42. 42.

    Rendle S (2010) Factorization machines. In: 2010 IEEE international conference on data mining (ICDM), pp 995–1000

  43. 43.

    Roth K, Lucchi A, Nowozin S, Hofmann T (2017) Stabilizing training of generative adversarial networks through regularization. In: Advances in neural information processing systems (NIPS), pp 2018–2028

  44. 44.

    Shin HC, Tenenholtz NA, Rogers JK, Schwarz CG, Senjem ML, Gunter JL, Andriole KP, Michalski M (2018) Medical image synthesis for data augmentation and anonymization using generative adversarial networks. In: International workshop on simulation and synthesis in medical imaging (SASHIMI). Springer, Berlin, pp 1–11

  45. 45.

    Song J, Guo Y, Gao L, Li X, Hanjalic A, Shen HT (2018) From deterministic to generative: multimodal stochastic rnns for video captioning. IEEE Trans Neural Netw Learn Syst (TNNLS) 30(10):3047–3058

    Article  Google Scholar 

  46. 46.

    Song J, Zhang J, Gao L, Liu X, Shen HT (2018) Dual conditional GANs for face aging and rejuvenation. In: International joint conference on artificial intelligence (IJCAI), pp 899–905

  47. 47.

    Sprent P, Smeeton NC (2000) Applied nonparametric statistical methods. Chapman and Hall/CRC

  48. 48.

    Srivastava N, Mansimov E, Salakhudinov R (2015) Unsupervised learning of video representations using LSTMs. In: International conference on machine learning (ICML), pp 843–852

  49. 49.

    Srivastava N, Salakhutdinov RR (2012) Multimodal learning with deep boltzmann machines. In: Advances in neural information processing systems (NIPS), pp 2222–2230

  50. 50.

    Sun J, Lang J, Fujita H, Li H (2018) Imbalanced enterprise credit evaluation with DTE-SBD: decision tree ensemble based on SMOTE and bagging with differentiated sampling rates. Inf Sci 425:76–91

    MathSciNet  Article  Google Scholar 

  51. 51.

    Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems (NIPS), pp 3104–3112

  52. 52.

    Suzuki M, Nakayama K, Matsuo Y (2017) Joint multimodal learning with deep generative models. In: International conference on learning representations (ICLR) (Workshop)

  53. 53.

    Tsai C, Lin W, Hu Y, Yao G (2019) Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf Sci 477:47–54

    Article  Google Scholar 

  54. 54.

    Tsai Y-HH, Bai S, Liang PP, Kolter JZ, Morency L-P, Salakhutdinov R (2019) Multimodal transformer for unaligned multimodal language sequences. In: The 57th annual meeting of the association for computational linguistics (ACL 2019), pp 6558–6569

  55. 55.

    Vartak MN, et al. (1955) On an application of kronecker product of matrices to statistical designs. Ann Math Stat 26(3):420–438

    MathSciNet  Article  Google Scholar 

  56. 56.

    Wu M, Goodman N (2018) Multimodal generative models for scalable weakly-supervised learning. In: Advances in neural information processing systems (NIPS), pp 5575–5585

  57. 57.

    Xu T, Zhang P, Huang Q, Zhang H, Gan Z, Huang X, He X (2018) AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 1316–1324

  58. 58.

    Yih Wt, He X, Meek C (2014) Semantic parsing for single-relation question answering. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (NAACL) (Short Papers), vol 2, pp 643–648

  59. 59.

    Yingzhen L, Mandt S (2018) Disentangled sequential autoencoder. In: International conference on machine learning (ICML), pp 5670–5679

  60. 60.

    Yu G, Li Q, Wang J, Zhang D, Liu Y (2020) A multimodal generative and fusion framework for recognizing faculty homepages. Inf Sci 525:205–220

    MathSciNet  Article  Google Scholar 

  61. 61.

    Zhang C, Yang Z, He X, Deng L (2020) Multimodal intelligence: representation learning, information fusion, and applications. IEEE Journal of Selected Topics in Signal Processing, 1–1

  62. 62.

    Zhu X, Hu H, Lin S, Dai J (2019) Deformable convnets v2: more deformable, better results. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 9308–9316

Download references


This work was supported by the National Natural Science Foundation of China (NSFC) (71671141 and 71873108), the National Social Science Foundation of China (NSSFC) (Grant No. 19BFX120), the Fundamental Research Funds for the Central Universities (JBK 171113, JBK 170505, JBK 1806003, and JBK 2002030), the Science and Technology Department of Sichuan Province (2019YJ0250), the Fintech Innovation Center of Southwestern University of Finance and Economics, and the Financial Intelligence and Financial Engineering Key Laboratory of Sichuan Province.

Author information



Corresponding author

Correspondence to Guanyuan Yu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Li, Q., Yu, G., Wang, J. et al. A deep multimodal generative and fusion framework for class-imbalanced multimodal data. Multimed Tools Appl (2020). https://doi.org/10.1007/s11042-020-09227-4

Download citation


  • Multimodal classification
  • Class-imbalanced data
  • Deep multimodal generative adversarial network
  • Deep multimodal hybrid fusion network