Skip to main content
Log in

Identifying affective levels on music video via completing the missing modality

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Emotion tagging is one theme of interest in affective computing, which labels stimuli with human understandable semantic information. Previous works indicate that modality fusion could improve the performance of this kind of tasks. However, acquiring the subjects’ responses is costly and time consuming, leading to that the response modality is absent for large part of multimedia contents, which is required by modality fusion methods. To address this problem, in this paper a novel emotion tagging framework is proposed, which completes the missing response modalities based on the conception of brain encoding. In the framework, an encoding model is built based on the response modality from subjects’ responses and the stimulus modality from stimulus contents. Then the model is applied to those videos whose response modalities are absent to complete the missing response modalities. Modality fusion is finally conducted on stimulus modality and response modality and followed by the classification methods. To test the performance of the proposed framework, DEAP dataset is adopted as a benchmark. In the experiments, three kinds of features are employed as stimulus modalities. Response modality and fused modality are computed under the proposed framework. Affective level identification is conducted as emotion tagging task. The results demonstrate that the accuracies of the proposed framework outperforms the accuracies obtained by using only stimulus modality. The improvements are higher than 5% for all kinds of stimulus modalities in valence and arousal in terms of accuracy. Additionally, the improvement of performance introduces no extra physiological data acquisition, saving economical and timing costs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Abadi M, Subramanian R, Kia S, Avesani P, Patras I, Sebe N (2015) Decaf: Meg-based multimodal database for decoding affective physiological responses. IEEE Trans Affect Comput 6(3):209–222

    Article  Google Scholar 

  2. Baveye Y, Dellandréa E, Chamaret C, Chen L (2015) Deep learning vs. kernel methods: performance for emotion prediction in videos. In: 2015 International conference on affective computing and intelligent interaction (ACII). IEEE, Piscataway, pp 77–83

  3. Canini L, Benini S, Leonardi R (2011) Affective analysis on patterns of shot types in movies. In: International Symposium on Image and signal processing and analysis (ISPA), pp 253–258

  4. Canini L, Benini S, Leonardi R (2013) Affective recommendation of movies based on selected connotative features. IEEE Trans Circuits Syst Video Technol 23(4):636–647

    Article  Google Scholar 

  5. Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–27

    Article  Google Scholar 

  6. Chang CY, Chang CW, Zheng JY, Chung PC (2013) Physiological emotion analysis using support vector regression. Neurocomputing 122:79–87

    Article  Google Scholar 

  7. Chang X, Yang Y (2017) Semisupervised feature analysis by mining correlations among multiple tasks. IEEE Transactions on Neural Networks and Learning Systems https://doi.org/10.1109/TNNLS.2016.2582746

    Article  MathSciNet  Google Scholar 

  8. Chang X, Nie F, Wang S, Yang Y, Zhou X, Zhang C (2016) Compound rank- k projections for bilinear analysis. IEEE Transactions on Neural Networks and Learning Systems 27(7):1502–1513

    Article  MathSciNet  Google Scholar 

  9. Chang X, Ma Z, Lin M, Yang Y, Hauptmann AG (2017) Feature interaction augmented sparse learning for fast kinect motion detection. IEEE Trans Image Process 26(8):3911–3920

    Article  MathSciNet  Google Scholar 

  10. Chang X, Ma Z, Yang Y, Zeng Z, Hauptmann AG (2017) Bi-level semantic representation analysis for multimedia event detection. IEEE Transactions on Cybernetics 47(5):1180–1197

    Article  Google Scholar 

  11. Chang X, Yu YL, Yang Y, Xing EP (2017) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39(8):1617–1632

    Article  Google Scholar 

  12. Chen M, Zheng A, Weinberger K (2013) Fast image tagging. In: International conference on machine learning, vol 28, pp 1274–1282

  13. Chen M, Han J, Guo L, Wang J, Patras I (2015) Identifying valence and arousal levels via connectivity between eeg channels. In: International conference on affective computing and intelligent interaction, pp 63–69

  14. Chen T, Borth D, Darrell T, Chang SF (2014) Deepsentibank: Visual sentiment concept classification with deep convolutional neural networks. arXiv:14108586

  15. Cheng G, Han J, Guo L, Liu Z, Bu S, Ren J (2015) Effective and efficient midlevel visual elements-oriented land-use classification using VHR remote sensing images. IEEE Trans Geosci Remote Sens 53(8):4238–4249

    Article  Google Scholar 

  16. Cheng G, Zhou P, Han J (2016) Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. IEEE Trans Geosci Remote Sens 54(12):7405–7415

    Article  Google Scholar 

  17. Cheng G, Han J, Lu X (2017) Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE. https://doi.org/10.1109/JPROC.2017.2675998

    Article  Google Scholar 

  18. Dietz R, Lang A (1999) Affective agents: effects of agent affect on arousal, attention, liking and learning. In: Proceedings of the Third International Cognitive Technology Conference, San Francisco

  19. Drucker H, Burges CJC, Kaufman L, Smola AJ, Vapnik V (1997) Support vector regression machines. In: Mozer MC, Jordan MI, Petsche T (eds) Advances in neural information processing systems 9. MIT press, Cambridge, pp 155–161

  20. Ekman P, Friesen WV, O’Sullivan M, Chan A, Diacoyanni-Tarlatzis I, Heider K, Krause R, LeCompte WA, Pitcairn T, Ricci-Bitti PE et al (1987) Universals and cultural differences in the judgments of facial expressions of emotion. J Pers Soc Psychol 53(4):712

    Article  Google Scholar 

  21. Ganchev T, Fakotakis N, Kokkinakis G (2005) Comparative evaluation of various mfcc implementations on the speaker verification task. In: Proceedings of the SPECOM, vol 1, pp 191–194

  22. Guo D, Zhang J, Liu X, Cui Y, Zhao C (2014) Multiple kernel learning based multi-view spectral clustering. In: International conference on pattern recognition, pp 3774–3779

  23. Hanjalic A, Xu LQ (2005) Affective video content representation and modeling. IEEE Trans Multimedia 7(1):143–154

    Article  Google Scholar 

  24. Haralick R, Shanmugam K, Dinstein I (1973) Textural features for image classification. IEEE Trans Syst Man Cybern SMC 3(6):610–621

    Article  Google Scholar 

  25. Koelstra S, Muhl C, Soleymani M, Lee JS, Yazdani A, Ebrahimi T, Pun T, Nijholt A, Patras I (2012) Deap: A database for emotion analysis; using physiological signals. IEEE Trans Affect Comput 3(1):18–31

    Article  Google Scholar 

  26. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  27. Lartillot O, Toiviainen P (2007) A matlab toolbox for musical feature extraction from audio. In: International conference on digital audio effects, pp 237–244

  28. Lartillot O, Toiviainen P, Eerola T (2008) A matlab toolbox for music information retrieval. In: Data analysis, machine learning and applications. Springer, Berlin, pp 261–268

    Chapter  Google Scholar 

  29. Ma Z, Chang X, Xu Z, Sebe N, Hauptmann AG (2017) Joint attributes and event analysis for multimedia event detection. IEEE Transactions on Neural Networks and Learning Systems, https://doi.org/10.1109/TNNLS.2017.2709308

  30. Ma Z, Chang X, Yang Y, Sebe N, Hauptmann AG (2017) The many shades of negativity. IEEE Trans Multimedia 19(7):1558–1568

    Article  Google Scholar 

  31. Mermelstein P (1976) Distance measures for speech recognition, psychological and instrumental. Pattern Recognit Artif Intell 116:374–388

    Google Scholar 

  32. Müller M (2007) Information retrieval for music and motion, vol 2. Springer, Berlin

    Book  Google Scholar 

  33. Naselaris T, Kay KN, Nishimoto S, Gallant JL (2011) Encoding and decoding in fmri. Neuroimage 56(2):400–410

    Article  Google Scholar 

  34. Picard RW (1995) Affective computing. Tech. rep., M.I.T. Media Laboratory Perceptual Computing Section

  35. Picard RW (2000) Affective computing. MIT press, Cambridge

    Google Scholar 

  36. Picard RW, Vyzas E, Healey J (2001) Toward machine emotional intelligence: analysis of affective physiological state. IEEE Trans Pattern Anal Mach Intell 23(10):1175–1191

    Article  Google Scholar 

  37. Rozgic V, Vitaladevuni S, Prasad R (2013) Robust eeg emotion classification using segment level decision fusion. In: IEEE international conference on acoustics, speech and signal processing, pp 1286–1290

  38. Russell JA, Mehrabian A (1977) Evidence for a three-factor theory of emotions. J Res Pers 11(3):273–294

    Article  Google Scholar 

  39. Schacter DL, Gilbert DT, Wegner DM (2011) Psychology, 2nd edn. Worth Publishers, Basingstoke, p 310

    Google Scholar 

  40. Soleymani M, Kierkels JJM, Chanel G, Pun T (2009) A bayesian framework for video affective representation. In: International conference on affective computing and intelligent interaction and workshops, pp 1–7

  41. Soleymani M, Lichtenauer J, Pun T, Pantic M (2012) A multimodal database for affect recognition and implicit tagging. IEEE Trans Affect Comput 3(1):42–55

    Article  Google Scholar 

  42. Tao J, Tan T (2005) Affective computing: A review. In: Affective Computing and Intelligent Interaction, Lecture Notes in Computer Science, vol 3784. Springer, Berlin, pp 981–995

    Chapter  Google Scholar 

  43. Torres-Valencia C, Garcia-Arias H, Alvarez Lopez M, Orozco-gutierrez A (2014) Comparative analysis of physiological signals and electroencephalogram (eeg) for multimodal emotion recognition using generative models. In: Symposium on image, signal processing and artificial vision, pp 1–5

  44. Wagner J, Kim J, Andre E (2005) From physiological signals to emotions: implementing and comparing selected methods for feature extraction and classification. In: 2005 IEEE international conference on multimedia and expo, pp 940–943

  45. Wang HL, Cheong LF (2006) Affective understanding in film. IEEE Trans Circuits Syst Video Technol 16(6):689–704

    Article  Google Scholar 

  46. Wang J, Zhou Y, Wang H, Yang X, Yang F, Peterson A (2015) Image tag completion by local learning. In: International symposium on neural networks. Springer International Publishing, Switzerland, pp 232–239

    Chapter  Google Scholar 

  47. Wang S, Zhu Y, Wu G, Ji Q (2014) Hybrid video emotional tagging using users’ eeg and video content. Multimedia Tools and Applications 72(2):1257–1283

    Article  Google Scholar 

  48. Yang Y, Huang Z, Yang Y, Liu J, Shen HT, Luo J (2013) Local image tagging via graph regularized joint group sparsity. Pattern Recogn 46(5):1358–1368

    Article  Google Scholar 

  49. Yang Y, Gao Y, Zhang H, Shao J, Chua TS (2014) Image tagging with social assistance. In: Proceedings of international conference on multimedia retrieval. ACM, New York, pp 81–88

  50. Yao X, Han J, Zhang D, Nie F (2017) Revisiting co-saliency detection: a novel approach based on two-stage multi-view spectral rotation co-clustering. IEEE Trans Image Process 26(7):3196–3209

    Article  MathSciNet  Google Scholar 

  51. Yi-Hsuan Y, Chen HH (2011) Prediction of the distribution of perceived music emotions using discrete samples. IEEE Trans Audio Speech Lang Process 19(7):2184–2196

    Article  Google Scholar 

  52. Zhang D, Han J, Han J, Shao L (2016) Cosaliency detection based on intrasaliency prior transfer and deep intersaliency mining. IEEE Transactions on Neural Networks and Learning Systems 27(6):1163–1176

    Article  MathSciNet  Google Scholar 

  53. Zhang D, Han J, Li C, Wang J, Li X (2016) Detection of co-salient objects by looking deep and wide. Int J Comput Vis 120(2):215–232

    Article  MathSciNet  Google Scholar 

  54. Zhang D, Han J, Jiang L, Ye S, Chang X (2017) Revealing event saliency in unconstrained video collection. IEEE Trans Image Process 26(4):1746–1758

    Article  MathSciNet  Google Scholar 

  55. Zhang D, Meng D, Han J (2017) Co-saliency detection via a self-paced multiple-instance learning framework. IEEE Trans Pattern Anal Mach Intell 39(5):865–878

    Article  Google Scholar 

  56. Zhuang X, Rozgic V, Crystal M (2014) Compact unsupervised eeg response representation for emotion recognition. In: IEEE-EMBS international conference on biomedical and health informatics, pp 736–739

Download references

Acknowledgements

This work was supported in part by the National Science Foundation of China under Grant 61401357 and the Fundamental Research Funds for the Central Universities under Grants 3102016ZY023.

The authors would like to express their appreciation to Multimedia Vision Group in Queen Mary, University of London, for the help during the first author working in Multimedia Vision Group.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gong Cheng.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, M., Cheng, G. & Guo, L. Identifying affective levels on music video via completing the missing modality. Multimed Tools Appl 77, 3287–3302 (2018). https://doi.org/10.1007/s11042-017-5125-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-017-5125-8

Keywords

Navigation