Multi-modal learning for affective content analysis in movies



Affective content analysis is an important research topic in video content analysis, and has extensive applications in many fields. However, it is a challenging task to design a computational model for predicting emotions induced by videos, since the elicited emotions can be considered relatively subjective. Intuitively, several features of different modalities can depict the elicited emotions, but the correlation and influence of these features are still not well studied. To address this issue, we propose a multi-modal learning framework, which classifies affective contents in the valence-arousal space. In particular, we utilize the features extracted by the methods of motion keypoint trajectory and convolutional neural networks to depict the visual modality of elicited emotions, and extract a global audio feature by the openSMILE toolkit to describe the audio modality. Then, the linear support vector machine and support vector regression are employed to learn the affective models. By comparing these three features with five baseline features, we discover that the three features are significant for describing affective content. Experimental results also demonstrate that the three features complement each other. Moreover, the proposed framework obtains the state-of-the-art results on two challenging datasets of video affective content analysis.


Affective content analysis Convolutional neural networks Motion keypoint trajectory Multi-modal learning Trajectory-based covariance 



This work was supported in part by the National Natural Science Foundation of China under Grants 61622115 and 61472281, the Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning (No. GZ2015005), Shanghai Engineering Research Center of Industrial Vision Perception & Intelligent Computing (17DZ2251600), and the Key Research and Development Project of Jiangxi Provincial Department of Science and Technology (20171BBE50065).


  1. 1.
    Acar E, Hopfgartner F, Albayrak S (2017) A comprehensive study on mid-level representation and ensemble learning for emotional analysis of video material. Multimed Tools Appl 76(9):11,809–11,837CrossRefGoogle Scholar
  2. 2.
    Anastasia T, Leontios H (2016) AUTH-SGP in MediaEval 2016 emotional impact of movies task. In: MediaEval 2016 WorkshopGoogle Scholar
  3. 3.
    Arsigny V, Fillard P, Pennec X, Ayache N (2006) Log-Euclidean metrics for fast and simple calculus on diffusion tensors. Magn Resonan Med 56(2):411–421CrossRefGoogle Scholar
  4. 4.
    Baecchi C, Uricchio T, Bertini M, Del Bimbo A (2017) Deep sentiment features of context and faces for affective video analysis. In: ICMR’17, pp 72–77Google Scholar
  5. 5.
    Baveye Y, Dellandrea E, Chamaret C, Chen L (2015) LIRIS-ACCEDE: a video database for affective content analysis. IEEE Trans Affect Comput 6(1):43–55CrossRefGoogle Scholar
  6. 6.
    Baveye Y, Chamaret C, Dellandréa E, Chen L (2017) Affective video content analysis: a multidisciplinary insight. IEEE Trans Affect ComputGoogle Scholar
  7. 7.
    Bosch A, Zisserman A, Munoz X (2007) Image classification using random forests and ferns. In: ICCV’07, pp 1–8Google Scholar
  8. 8.
    Canini L, Benini S, Leonardi R (2013) Affective recommendation of movies based on selected connotative features. IEEE Trans Circuits Syst Video Technol 23 (4):636–647CrossRefGoogle Scholar
  9. 9.
    Chakraborty R, Maurya AK, Pandharipande M, Hassan E, Ghosh H, Kopparapu SK (2015) TCS-ILAB-MediaEval 2015: affective impact of movies and violent scene detection. In: MediaEval 2015 WorkshopGoogle Scholar
  10. 10.
    Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27:1–27:27CrossRefGoogle Scholar
  11. 11.
    Chen S, Jin Q (2016) RUC at MediaEval 2016 emotional impact of movies task: fusion of multimodal features. In: MediaEval 2016 WorkshopGoogle Scholar
  12. 12.
    Dai Q, Zhao RW, Wu Z, Wang X, Gu Z, Wu W, Jiang YG (2015) Fudan-Huawei at MediaEval 2015: detecting violent scenes and affective impact in movies with deep learning. In: MediaEval 2015 WorkshopGoogle Scholar
  13. 13.
    Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: CVPR’05, pp 886–893Google Scholar
  14. 14.
    Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: ECCV’06, pp 428–441Google Scholar
  15. 15.
    Dellandréa E, Chen L, Baveye Y, Sjöberg MV, Chamaret C et al (2016) The mediaeval 2016 emotional impact of movies task. In: MediaEval 2016 WorkshopGoogle Scholar
  16. 16.
    Eggink J, Bland D (2012) A large scale experiment for mood-based classification of tv programmes. In: ICME’12, pp 140–145Google Scholar
  17. 17.
    Ellis DPW (2005) PLP and RASTA (and MFCC, and inversion) in Matlab., online web resource
  18. 18.
    Eyben F, Weninger F, Gross F, Schuller B (2013) Recent developments in opensmile, the munich open-source multimedia feature extractor. In: ACM MM’13, pp 835–838Google Scholar
  19. 19.
    Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874MATHGoogle Scholar
  20. 20.
    Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: AISTATS’10, pp 249–256Google Scholar
  21. 21.
    Hanjalic A, Xu LQ (2005) Affective video content representation and modeling. IEEE Trans Multimed 7(1):143–154CrossRefGoogle Scholar
  22. 22.
    Ho CH, Lin CJ (2012) Large-scale linear support vector regression. J Mach Learn Res 13:3323–3348MathSciNetMATHGoogle Scholar
  23. 23.
    Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML’15, pp 448–456Google Scholar
  24. 24.
    Irie G, Satou T, Kojima A, Yamasaki T, Aizawa K (2010) Affective audio-visual words and latent topic driving model for realizing movie affective scene classification. IEEE Trans Multimed 12(6):523–535CrossRefGoogle Scholar
  25. 25.
    Jan A, Gaus YFBA, Meng H, Zhang F (2016) BUL in MediaEval 2016 emotional impact of movies task. In: MediaEval 2016 WorkshopGoogle Scholar
  26. 26.
    Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: ACM MM’14, pp 675–678Google Scholar
  27. 27.
    Jiang YG, Xu B, Xue X (2014) Predicting emotions in user-generated videos. In: AAAI’14, pp 73–79Google Scholar
  28. 28.
    Lam V, Phan S, Le DD, Satoh S, Duong DA (2015) NII-UIT at MediaEval 2015 affective impact of movies task. In: MediaEval 2015 WorkshopGoogle Scholar
  29. 29.
    Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: CVPR’08, pp 1–8Google Scholar
  30. 30.
    Li C, Feng Z, Xu C (2016) Error-correcting output codes for multi-label emotion classification. Multimed Tools Appl 75(22):14,399–14,416CrossRefGoogle Scholar
  31. 31.
    Lin CJ, Weng RC, Keerthi SS (2007) Trust region newton method for large-scale logistic regression. J Mach Learn Res 9(2):561–568MathSciNetGoogle Scholar
  32. 32.
    Liu Y, Gu Z, Zhang Y, Liu Y (2016) Mining emotional features of movies. In: MediaEval 2016 WorkshopGoogle Scholar
  33. 33.
    Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110CrossRefGoogle Scholar
  34. 34.
    Ma J, Zhao J, Tian J, Yuille AL, Tu Z (2014) Robust point matching via vector field consensus. IEEE Trans Image Process 23(4):1706–1721MathSciNetCrossRefMATHGoogle Scholar
  35. 35.
    Ma Y, Ye Z, Xu M (2016) THU-HCSI at MediaEval 2016: emotional impact of movies task. In: MediaEval 2016 workshopGoogle Scholar
  36. 36.
    Marin Vlastelica P, Hayrapetyan S, Tapaswi M, Stiefelhagen R (2015) KIT at MediaEval 2015–evaluating visual cues for affective impact of movies task. In: MediaEval 2015 workshopGoogle Scholar
  37. 37.
    Mironica I, Ionescu B, Sjöberg M, Schedl M, Skowron M (2015) RFA at MediaEval 2015 affective impact of movies task: a multimodal approach. In: MediaEval 2015 workshopGoogle Scholar
  38. 38.
    Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830MathSciNetMATHGoogle Scholar
  39. 39.
    Poria S, Cambria E, Hussain A, Huang GB (2015) Towards an intelligent framework for multimodal affective data analysis. Neural Netw 63:104–116CrossRefGoogle Scholar
  40. 40.
    Sanchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: theory and practice. Int J Comput Vis 105(3):222–245MathSciNetCrossRefMATHGoogle Scholar
  41. 41.
    Sang J, Xu C (2012) Right buddy makes the difference: an early exploration of social relation analysis in multimedia applications. In: ACM MM’12, pp 19–28Google Scholar
  42. 42.
    Sang J, Xu C, Liu J (2012) User-aware image tag refinement via ternary semantic analysis. IEEE Trans Multimed 14(3):883–895CrossRefGoogle Scholar
  43. 43.
    Schuller B, Steidl S, Batliner A, Burkhardt F, Devillers L, Müller CA, Narayanan SS (2010) The INTERSPEECH 2010 paralinguistic challenge. In: INTERSPEECH’10Google Scholar
  44. 44.
    Seddati O, Kulah E, Pironkov G, Dupont S, Mahmoudi S, Dutoit T (2015) UMons at MediaEval 2015 affective impact of movies task including violent scenes detection. In: MediaEval 2015 workshopGoogle Scholar
  45. 45.
    Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: NIPS’14, pp 568–576Google Scholar
  46. 46.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:14091556
  47. 47.
    Sjöberg M, Baveye Y, Wang H, Quang VL, Ionescu B, Dellandréa E, Schedl M, Demarty CH, Chen L (2015) The MediaEval 2015 affective impact of movies task. In: MediaEval 2015 workshopGoogle Scholar
  48. 48.
    Snoek CG, Worring M, Smeulders AW (2005) Early versus late fusion in semantic video analysis. In: ACM MM’05, pp 399–402Google Scholar
  49. 49.
    Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. CRCV-TR-12-01Google Scholar
  50. 50.
    Sun K, Yu J (2007) Video affective content representation and recognition using video affective tree and hidden markov models. In: ACII’07, pp 594–605Google Scholar
  51. 51.
    Sural S, Qian G, Pramanik S (2002) Segmentation and histogram generation using the HSV color space for image retrieval. In: ICIP’02, pp 589–592Google Scholar
  52. 52.
    Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: CVPR’16, pp 2818–2826Google Scholar
  53. 53.
    Teixeira RMA, Yamasaki T, Aizawa K (2012) Determination of emotional content of video clips by low-level audiovisual features. Multimed Tools Appl 61(1):21–49CrossRefGoogle Scholar
  54. 54.
    Tieleman T (2008) Training restricted boltzmann machines using approximations to the likelihood gradient. In: ICML’08, pp 1064–1071Google Scholar
  55. 55.
    Trigeorgis G, Coutinho E, Ringeval F, Marchi E, Zafeiriou S, Schuller B (2015) The ICL-TUM-PASSAU approach for the MediaEval 2015 affective impact of movies task. In: MediaEval 2015 WorkshopGoogle Scholar
  56. 56.
    Vedaldi A, Fulkerson B (2010) Vlfeat: an open and portable library of computer vision algorithms. In: ACM MM’10, pp 1469–1472Google Scholar
  57. 57.
    Verma GK, Tiwary US (2016) Affect representation and recognition in 3d continuous valence–arousal–dominance space. Multimed Tools Appl 1–25Google Scholar
  58. 58.
    Wang HL, Cheong LF (2006) Affective understanding in film. IEEE Trans Circ Syst Video Technol 16(6):689–704CrossRefGoogle Scholar
  59. 59.
    Wang S, Ji Q (2015) Video affective content analysis: a survey of state-of-the-art methods. IEEE Trans Affect Comput 6(4):410–430CrossRefGoogle Scholar
  60. 60.
    Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: CVPR’11, pp 3169–3176Google Scholar
  61. 61.
    Wang H, Yi Y, Wu J (2015) Human action recognition with trajectory based covariance descriptor in unconstrained videos. In: ACM MM’15, pp 1175–1178Google Scholar
  62. 62.
    Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: ECCV’16, pp 20–36Google Scholar
  63. 63.
    Xu M, Wang J, He X, Jin J S, Luo S, Lu H (2014) A three-level framework for affective content analysis and its case studies. Multimed Tools Appl 70 (2):757–779CrossRefGoogle Scholar
  64. 64.
    Yi Y, Wang H (2017) Motion keypoint trajectory and covariance descriptor for human action recognition. Vis Comput 1–13Google Scholar
  65. 65.
    Yi Y, Wang H, Zhang B, Yu J (2015) MIC-TJU in MediaEval 2015 affective impact of movies task. In: MediaEval 2015 workshopGoogle Scholar
  66. 66.
    Yu HF, Huang FL, Lin CJ (2011) Dual coordinate descent methods for logistic regression and maximum entropy models. Mach Learn 85(1):41–75MathSciNetCrossRefMATHGoogle Scholar
  67. 67.
    Yu J, Rui Y, Tao D (2014) Click prediction for web image reranking using multimodal sparse coding. IEEE Trans Image Process 23(5):2019–32MathSciNetCrossRefMATHGoogle Scholar
  68. 68.
    Yu J, Yang X, Gao F, Tao D (2017) Deep multimodal distance metric learning using click constraints for image ranking. IEEE Trans Cybern 47(12):4014–4024CrossRefGoogle Scholar
  69. 69.
    Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: ICCV’17, pp 1 – 10Google Scholar
  70. 70.
    Zach C, Pock T, Bischof H (2007) A duality based approach for realtime T VL 1 optical flow. In: Joint pattern recognition symposium, pp 214–223Google Scholar
  71. 71.
    Zhang S, Tian Q, Jiang S, Huang Q, Gao W (2008) Affective MTV analysis based on arousal and valence features. In: ICME’08, pp 1369–1372Google Scholar
  72. 72.
    Zhang S, Tian Q, Huang Q, Gao W, Li S (2009) Utilizing affective analysis for efficient movie browsing. In: ICIP’09, pp 1853–1856Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer Science and TechnologyTongji UniversityShanghaiPeople’s Republic of China
  2. 2.Key Laboratory of Embedded System and Service Computing, Ministry of EducationTongji UniversityShanghaiPeople’s Republic of China
  3. 3.Department of Mathematics and Computer ScienceGannan Normal UniversityGanzhouPeople’s Republic of China
  4. 4.Shanghai Engineering Research Center of Industrial Vision Perception & Intelligent ComputingShanghaiPeople’s Republic of China

Personalised recommendations