Abstract
This paper proposes a novel deep neural network model to handle multimodal data. The proposed model seamlessly facilitates fusion of multimodal inputs and provides dimensional reduction of the input feature space. The architecture employs modified stacked autoencoder in conjunction with multilayer perceptron-based regression model. Two variants of architecture are proposed, and experiments have been performed on the multimodal benchmark data (RECOLA) to study the impact of multimodality as against a single modality. Experiments are also conducted to illustrate the effect of presenting multimodal data in sequential or concatenated manner. The results obtained are encouraging. The proposed approach is computationally less expensive than the existing approaches, and the performance is better or at par with other techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Fujisaki, W., Goda, N., Motoyoshi, I., Nishida, S., Komatsu, H.: Audiovisual integration in the human perception of materials. J. Vis. 14, 1–20 (2014)
Khorrami, P., Paine, T.L., Brady, K., Dagli, C., Huang, T.S.: How deep neural networks can improve emotion recognition on video data (2016). arXiv:1602.07377
Zhang, S., Zhang, S., Huang, T., Gao, W.: Multimodal deep convolutional neural network for audio-visual emotion recognition. In: Proceedings of the 2016 ACM International Conferences on Multimedia Retrieval, ICMR’16, pp. 281–284 (2016)
Wu, Z., Sivadas, S., Tan, Y.K., Bin, M., Goh, R.S.M.: Multi-modal hybrid deep neural network for speech enhancement (2016). arXiv:1606.04750
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conferences on Machine Learning, pp. 689–696 (2011)
Uzan, L., Wolf, L.: I know that voice: Identifying the voice actor behind the voice. In: Proceedings of the 2015 International Conferences on Biometrics, ICB’15, pp. 46–51 (2015)
Salah, A.: Perceptual information fusion in humans and machines. Cogn. Neurosci. Forum. (2007)
Yang, L., Jiang, D., He, L., Pei, E., Oveneke, M.C., Sahli, H.: Decision tree based depression classification from audio video and language information. In: Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, AVEC’16, pp. 89–96 (2016)
He, L., Jiang, D., Yang, L., Pei, E., Wu, P., Sahli, H.: Multimodal affective dimension prediction using deep bidirectional long short-term memory recurrent neural networks. In: Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, AVEC’15, pp. 73–80 (2015)
Fontaine, J.R.J., Scherer, K.R., Roesch, E.B., Ellsworth, P.C.: The world of emotions is not two-dimensional. Psychol. Sci. 18, 1050–1057 (2015)
Breazeal, C.: Emotion and sociable humanoid robots. Int. J. Hum Comput Stud. 59, 119–155 (2003)
Chao, L., Tao, J., Yang, M., Li, Y., Wen, Z.: Long short term memory recurrent neural network based multimodal dimensional emotion recognition, pp. 65–72 (2012). arXiv:1212.5701
Ringeval, F., Sonderegger, A., Sauer, J., Lalanne, D.: Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. In: 2013 10th IEEE International Conferences on Workshops on Automatic Face and Gesture Recognition (2013)
Keras Documentation. https://keras.io/
TensorFlow. https://www.tensorflow.org/
Wang, W., Ooi, C., Yang, X., Zhang, D., Zhuang, Y.: Effective multi-modal retrieval based on stacked auto-encoding. Proc. VLDB Endow. 7, 649–660 (2014)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014)
Ringeval, F., Schuller, B., Valstar, M., Jaiswal, S., Marchi, E., Lalanne, D., Cowie, R., Pantic, M.: Av + Ec 2015—The first affect recognition challenge bridging across audio, video, and physiological data. In: Proceedings of the 5th International Workshops Audio/Visual Emotion Challenge, AVEC’15, pp. 3–8 (2015)
Zeiler, M.D.: ADADELTA: an adaptive learning rate method (2012). arXiv:1212.5701
Valstar, M., Pantic, M., Gratch, J., Schuller, B., Ringeval, F., Lalanne, D., Torres Torres, M., Scherer, S., Stratou, G., Cowie, R.: AVEC 2016—depression, mood, and emotion recognition workshop and challenge. In: Proceedings of the 6th International Workshops Audio/Visual Emotion Challenge, AVEC’16, pp. 3–10 (2016)
Kächele, M., Thiam, P., Palm, G., Schwenker, F., Schels, M.: Ensemble methods for continuous affect recognition, pp. 9–16 (2012). arXiv:1212.5701
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Bhandari, D., Paul, S., Narayan, A. (2019). Multimodal Data Fusion and Prediction of Emotional Dimensions Using Deep Neural Network. In: Verma, N., Ghosh, A. (eds) Computational Intelligence: Theories, Applications and Future Directions - Volume II. Advances in Intelligent Systems and Computing, vol 799. Springer, Singapore. https://doi.org/10.1007/978-981-13-1135-2_17
Download citation
DOI: https://doi.org/10.1007/978-981-13-1135-2_17
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1134-5
Online ISBN: 978-981-13-1135-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)