Multimedia Tools and Applications

, Volume 77, Issue 3, pp 3287–3302 | Cite as

Identifying affective levels on music video via completing the missing modality

  • Mo Chen
  • Gong Cheng
  • Lei Guo


Emotion tagging is one theme of interest in affective computing, which labels stimuli with human understandable semantic information. Previous works indicate that modality fusion could improve the performance of this kind of tasks. However, acquiring the subjects’ responses is costly and time consuming, leading to that the response modality is absent for large part of multimedia contents, which is required by modality fusion methods. To address this problem, in this paper a novel emotion tagging framework is proposed, which completes the missing response modalities based on the conception of brain encoding. In the framework, an encoding model is built based on the response modality from subjects’ responses and the stimulus modality from stimulus contents. Then the model is applied to those videos whose response modalities are absent to complete the missing response modalities. Modality fusion is finally conducted on stimulus modality and response modality and followed by the classification methods. To test the performance of the proposed framework, DEAP dataset is adopted as a benchmark. In the experiments, three kinds of features are employed as stimulus modalities. Response modality and fused modality are computed under the proposed framework. Affective level identification is conducted as emotion tagging task. The results demonstrate that the accuracies of the proposed framework outperforms the accuracies obtained by using only stimulus modality. The improvements are higher than 5% for all kinds of stimulus modalities in valence and arousal in terms of accuracy. Additionally, the improvement of performance introduces no extra physiological data acquisition, saving economical and timing costs.


Affective computing Emotion tagging Brain encoding EEG Modality fusion 



This work was supported in part by the National Science Foundation of China under Grant 61401357 and the Fundamental Research Funds for the Central Universities under Grants 3102016ZY023.

The authors would like to express their appreciation to Multimedia Vision Group in Queen Mary, University of London, for the help during the first author working in Multimedia Vision Group.


  1. 1.
    Abadi M, Subramanian R, Kia S, Avesani P, Patras I, Sebe N (2015) Decaf: Meg-based multimodal database for decoding affective physiological responses. IEEE Trans Affect Comput 6(3):209–222CrossRefGoogle Scholar
  2. 2.
    Baveye Y, Dellandréa E, Chamaret C, Chen L (2015) Deep learning vs. kernel methods: performance for emotion prediction in videos. In: 2015 International conference on affective computing and intelligent interaction (ACII). IEEE, Piscataway, pp 77–83Google Scholar
  3. 3.
    Canini L, Benini S, Leonardi R (2011) Affective analysis on patterns of shot types in movies. In: International Symposium on Image and signal processing and analysis (ISPA), pp 253–258Google Scholar
  4. 4.
    Canini L, Benini S, Leonardi R (2013) Affective recommendation of movies based on selected connotative features. IEEE Trans Circuits Syst Video Technol 23(4):636–647CrossRefGoogle Scholar
  5. 5.
    Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–27CrossRefGoogle Scholar
  6. 6.
    Chang CY, Chang CW, Zheng JY, Chung PC (2013) Physiological emotion analysis using support vector regression. Neurocomputing 122:79–87CrossRefGoogle Scholar
  7. 7.
    Chang X, Yang Y (2017) Semisupervised feature analysis by mining correlations among multiple tasks. IEEE Transactions on Neural Networks and Learning Systems
  8. 8.
    Chang X, Nie F, Wang S, Yang Y, Zhou X, Zhang C (2016) Compound rank- k projections for bilinear analysis. IEEE Transactions on Neural Networks and Learning Systems 27(7):1502–1513MathSciNetCrossRefGoogle Scholar
  9. 9.
    Chang X, Ma Z, Lin M, Yang Y, Hauptmann AG (2017) Feature interaction augmented sparse learning for fast kinect motion detection. IEEE Trans Image Process 26(8):3911–3920MathSciNetCrossRefGoogle Scholar
  10. 10.
    Chang X, Ma Z, Yang Y, Zeng Z, Hauptmann AG (2017) Bi-level semantic representation analysis for multimedia event detection. IEEE Transactions on Cybernetics 47(5):1180–1197CrossRefGoogle Scholar
  11. 11.
    Chang X, Yu YL, Yang Y, Xing EP (2017) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39(8):1617–1632CrossRefGoogle Scholar
  12. 12.
    Chen M, Zheng A, Weinberger K (2013) Fast image tagging. In: International conference on machine learning, vol 28, pp 1274–1282Google Scholar
  13. 13.
    Chen M, Han J, Guo L, Wang J, Patras I (2015) Identifying valence and arousal levels via connectivity between eeg channels. In: International conference on affective computing and intelligent interaction, pp 63–69Google Scholar
  14. 14.
    Chen T, Borth D, Darrell T, Chang SF (2014) Deepsentibank: Visual sentiment concept classification with deep convolutional neural networks. arXiv:14108586
  15. 15.
    Cheng G, Han J, Guo L, Liu Z, Bu S, Ren J (2015) Effective and efficient midlevel visual elements-oriented land-use classification using VHR remote sensing images. IEEE Trans Geosci Remote Sens 53(8):4238–4249CrossRefGoogle Scholar
  16. 16.
    Cheng G, Zhou P, Han J (2016) Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. IEEE Trans Geosci Remote Sens 54(12):7405–7415CrossRefGoogle Scholar
  17. 17.
    Cheng G, Han J, Lu X (2017) Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE.
  18. 18.
    Dietz R, Lang A (1999) Affective agents: effects of agent affect on arousal, attention, liking and learning. In: Proceedings of the Third International Cognitive Technology Conference, San FranciscoGoogle Scholar
  19. 19.
    Drucker H, Burges CJC, Kaufman L, Smola AJ, Vapnik V (1997) Support vector regression machines. In: Mozer MC, Jordan MI, Petsche T (eds) Advances in neural information processing systems 9. MIT press, Cambridge, pp 155–161Google Scholar
  20. 20.
    Ekman P, Friesen WV, O’Sullivan M, Chan A, Diacoyanni-Tarlatzis I, Heider K, Krause R, LeCompte WA, Pitcairn T, Ricci-Bitti PE et al (1987) Universals and cultural differences in the judgments of facial expressions of emotion. J Pers Soc Psychol 53(4):712CrossRefGoogle Scholar
  21. 21.
    Ganchev T, Fakotakis N, Kokkinakis G (2005) Comparative evaluation of various mfcc implementations on the speaker verification task. In: Proceedings of the SPECOM, vol 1, pp 191–194Google Scholar
  22. 22.
    Guo D, Zhang J, Liu X, Cui Y, Zhao C (2014) Multiple kernel learning based multi-view spectral clustering. In: International conference on pattern recognition, pp 3774–3779Google Scholar
  23. 23.
    Hanjalic A, Xu LQ (2005) Affective video content representation and modeling. IEEE Trans Multimedia 7(1):143–154CrossRefGoogle Scholar
  24. 24.
    Haralick R, Shanmugam K, Dinstein I (1973) Textural features for image classification. IEEE Trans Syst Man Cybern SMC 3(6):610–621CrossRefGoogle Scholar
  25. 25.
    Koelstra S, Muhl C, Soleymani M, Lee JS, Yazdani A, Ebrahimi T, Pun T, Nijholt A, Patras I (2012) Deap: A database for emotion analysis; using physiological signals. IEEE Trans Affect Comput 3(1):18–31CrossRefGoogle Scholar
  26. 26.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105Google Scholar
  27. 27.
    Lartillot O, Toiviainen P (2007) A matlab toolbox for musical feature extraction from audio. In: International conference on digital audio effects, pp 237–244Google Scholar
  28. 28.
    Lartillot O, Toiviainen P, Eerola T (2008) A matlab toolbox for music information retrieval. In: Data analysis, machine learning and applications. Springer, Berlin, pp 261–268Google Scholar
  29. 29.
    Ma Z, Chang X, Xu Z, Sebe N, Hauptmann AG (2017) Joint attributes and event analysis for multimedia event detection. IEEE Transactions on Neural Networks and Learning Systems,
  30. 30.
    Ma Z, Chang X, Yang Y, Sebe N, Hauptmann AG (2017) The many shades of negativity. IEEE Trans Multimedia 19(7):1558–1568CrossRefGoogle Scholar
  31. 31.
    Mermelstein P (1976) Distance measures for speech recognition, psychological and instrumental. Pattern Recognit Artif Intell 116:374–388Google Scholar
  32. 32.
    Müller M (2007) Information retrieval for music and motion, vol 2. Springer, BerlinCrossRefGoogle Scholar
  33. 33.
    Naselaris T, Kay KN, Nishimoto S, Gallant JL (2011) Encoding and decoding in fmri. Neuroimage 56(2):400–410CrossRefGoogle Scholar
  34. 34.
    Picard RW (1995) Affective computing. Tech. rep., M.I.T. Media Laboratory Perceptual Computing SectionGoogle Scholar
  35. 35.
    Picard RW (2000) Affective computing. MIT press, CambridgeGoogle Scholar
  36. 36.
    Picard RW, Vyzas E, Healey J (2001) Toward machine emotional intelligence: analysis of affective physiological state. IEEE Trans Pattern Anal Mach Intell 23(10):1175–1191CrossRefGoogle Scholar
  37. 37.
    Rozgic V, Vitaladevuni S, Prasad R (2013) Robust eeg emotion classification using segment level decision fusion. In: IEEE international conference on acoustics, speech and signal processing, pp 1286–1290Google Scholar
  38. 38.
    Russell JA, Mehrabian A (1977) Evidence for a three-factor theory of emotions. J Res Pers 11(3):273–294CrossRefGoogle Scholar
  39. 39.
    Schacter DL, Gilbert DT, Wegner DM (2011) Psychology, 2nd edn. Worth Publishers, Basingstoke, p 310Google Scholar
  40. 40.
    Soleymani M, Kierkels JJM, Chanel G, Pun T (2009) A bayesian framework for video affective representation. In: International conference on affective computing and intelligent interaction and workshops, pp 1–7Google Scholar
  41. 41.
    Soleymani M, Lichtenauer J, Pun T, Pantic M (2012) A multimodal database for affect recognition and implicit tagging. IEEE Trans Affect Comput 3(1):42–55CrossRefGoogle Scholar
  42. 42.
    Tao J, Tan T (2005) Affective computing: A review. In: Affective Computing and Intelligent Interaction, Lecture Notes in Computer Science, vol 3784. Springer, Berlin, pp 981–995Google Scholar
  43. 43.
    Torres-Valencia C, Garcia-Arias H, Alvarez Lopez M, Orozco-gutierrez A (2014) Comparative analysis of physiological signals and electroencephalogram (eeg) for multimodal emotion recognition using generative models. In: Symposium on image, signal processing and artificial vision, pp 1–5Google Scholar
  44. 44.
    Wagner J, Kim J, Andre E (2005) From physiological signals to emotions: implementing and comparing selected methods for feature extraction and classification. In: 2005 IEEE international conference on multimedia and expo, pp 940–943Google Scholar
  45. 45.
    Wang HL, Cheong LF (2006) Affective understanding in film. IEEE Trans Circuits Syst Video Technol 16(6):689–704CrossRefGoogle Scholar
  46. 46.
    Wang J, Zhou Y, Wang H, Yang X, Yang F, Peterson A (2015) Image tag completion by local learning. In: International symposium on neural networks. Springer International Publishing, Switzerland, pp 232–239Google Scholar
  47. 47.
    Wang S, Zhu Y, Wu G, Ji Q (2014) Hybrid video emotional tagging using users’ eeg and video content. Multimedia Tools and Applications 72(2):1257–1283CrossRefGoogle Scholar
  48. 48.
    Yang Y, Huang Z, Yang Y, Liu J, Shen HT, Luo J (2013) Local image tagging via graph regularized joint group sparsity. Pattern Recogn 46(5):1358–1368CrossRefMATHGoogle Scholar
  49. 49.
    Yang Y, Gao Y, Zhang H, Shao J, Chua TS (2014) Image tagging with social assistance. In: Proceedings of international conference on multimedia retrieval. ACM, New York, pp 81–88Google Scholar
  50. 50.
    Yao X, Han J, Zhang D, Nie F (2017) Revisiting co-saliency detection: a novel approach based on two-stage multi-view spectral rotation co-clustering. IEEE Trans Image Process 26(7):3196–3209MathSciNetCrossRefGoogle Scholar
  51. 51.
    Yi-Hsuan Y, Chen HH (2011) Prediction of the distribution of perceived music emotions using discrete samples. IEEE Trans Audio Speech Lang Process 19(7):2184–2196CrossRefGoogle Scholar
  52. 52.
    Zhang D, Han J, Han J, Shao L (2016) Cosaliency detection based on intrasaliency prior transfer and deep intersaliency mining. IEEE Transactions on Neural Networks and Learning Systems 27(6):1163–1176MathSciNetCrossRefGoogle Scholar
  53. 53.
    Zhang D, Han J, Li C, Wang J, Li X (2016) Detection of co-salient objects by looking deep and wide. Int J Comput Vis 120(2):215–232MathSciNetCrossRefGoogle Scholar
  54. 54.
    Zhang D, Han J, Jiang L, Ye S, Chang X (2017) Revealing event saliency in unconstrained video collection. IEEE Trans Image Process 26(4):1746–1758MathSciNetCrossRefGoogle Scholar
  55. 55.
    Zhang D, Meng D, Han J (2017) Co-saliency detection via a self-paced multiple-instance learning framework. IEEE Trans Pattern Anal Mach Intell 39(5):865–878CrossRefGoogle Scholar
  56. 56.
    Zhuang X, Rozgic V, Crystal M (2014) Compact unsupervised eeg response representation for emotion recognition. In: IEEE-EMBS international conference on biomedical and health informatics, pp 736–739Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.School of AutomationNorthwestern Polytechnical UniversityXi’anChina

Personalised recommendations