CentralNet: A Multilayer Approach for Multimodal Fusion

  • Valentin VielzeufEmail author
  • Alexis Lechervy
  • Stéphane Pateux
  • Frédéric Jurie
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11134)


This paper proposes a novel multimodal fusion approach, aiming to produce best possible decisions by integrating information coming from multiple media. While most of the past multimodal approaches either work by projecting the features of different modalities into the same space, or by coordinating the representations of each modality through the use of constraints, our approach borrows from both visions. More specifically, assuming each modality can be processed by a separated deep convolutional network, allowing to take decisions independently from each modality, we introduce a central network linking the modality specific networks. This central network not only provides a common feature embedding but also regularizes the modality specific networks through the use of multi-task learning. The proposed approach is validated on 4 different computer vision tasks on which it consistently improves the accuracy of existing multimodal fusion approaches.


Multimodal fusion Neural networks Representation learning Multi-task learning 


  1. 1.
    Abu-El-Haija, S., et al.: YouTube-8M: a large-scale video classification benchmark. CoRR abs/1609.08675 (2016)Google Scholar
  2. 2.
    Wang, Z., et al.: Truly multi-modal YouTube-8M video classification with video, audio, and text. CoRR abs/1706.05461 (2017)Google Scholar
  3. 3.
    Dhall, A., et al.: Collecting large, richly annotated facial-expression databases from movies. IEEE MultiMedia 19, 34–41 (2012)CrossRefGoogle Scholar
  4. 4.
    Ringeval, F., Schuller, B., Valstar, M., Gratch, J., Cowie, R., Pantic, M.: Summary for avec 2017: real-life depression and affect challenge and workshop. In: Proceedings of the 2017 ACM on Multimedia Conference, pp. 1963–1964. ACM (2017)Google Scholar
  5. 5.
    Hu, P., Cai, D., Wang, S., Yao, A., Chen, Y.: Learning supervised scoring ensemble for emotion recognition in the wild. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction, pp. 553–560. ACM (2017)Google Scholar
  6. 6.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)Google Scholar
  7. 7.
    Kiela, D., Grave, E., Joulin, A., Mikolov, T.: Efficient large-scale multi-modal classification. CoRR abs/1802.02892 (2018)Google Scholar
  8. 8.
    Atrey, P.K., Hossain, M.A., El Saddik, A., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimedia Syst. 16(6), 345–379 (2010)CrossRefGoogle Scholar
  9. 9.
    Andrew, G., Arora, R., Bilmes, J., Livescu, K.: Deep canonical correlation analysis. In: International Conference on Machine Learning, pp. 1247–1255 (2013)Google Scholar
  10. 10.
    Lahat, D., Adali, T., Jutten, C.: Multimodal data fusion: an overview of methods, challenges, and prospects. Proc. IEEE 103(9), 1449–1477 (2015)CrossRefGoogle Scholar
  11. 11.
    Kim, D.H., Lee, M.K., Choi, D.Y., Song, B.C.: Multi-modal emotion recognition using semi-supervised learning and multiple neural networks in the wild. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction, pp. 529–535. ACM (2017)Google Scholar
  12. 12.
    Vielzeuf, V., Pateux, S., Jurie, F.: Temporal multimodal fusion for video emotion classification in the wild. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction, pp. 569–576. ACM (2017)Google Scholar
  13. 13.
    Arevalo, J., Solorio, T., Montes-y Gómez, M., González, F.A.: Gated multimodal units for information fusion. In: ICLR Worshop (2017)Google Scholar
  14. 14.
    Chen, M., Wang, S., Liang, P.P., Baltrušaitis, T., Zadeh, A., Morency, L.P.: Multimodal sentiment analysis with word-level fusion and reinforcement learning. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction, pp. 163–171. ACM (2017)Google Scholar
  15. 15.
    Neverova, N., Wolf, C., Taylor, G.W., Nebout, F.: Multi-scale deep learning for gesture detection and localization. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8925, pp. 474–490. Springer, Cham (2015). Scholar
  16. 16.
    Yang, X., Molchanov, P., Kautz, J.: Multilayer and multimodal fusion of deep neural networks for video classification. In: Proceedings of the 2016 ACM on Multimedia Conference, pp. 978–987. ACM (2016)Google Scholar
  17. 17.
    Cangea, C., Velickovic, P., Liò, P.: XFlow: 1D-2D cross-modal deep neural networks for audiovisual classification. CoRR abs/1709.00572 (2017)Google Scholar
  18. 18.
    Gu, Z., Lang, B., Yue, T., Huang, L.: Learning joint multimodal representation based on multi-fusion deep neural networks. In: Liu, D., Xie, S., Li, Y., Zhao, D., El-Alfy, E.S. (eds.) ICONIP 2017. LNCS, vol. 10635, pp. 276–285. Springer, Cham (2017). Scholar
  19. 19.
    Kang, M., Ji, K., Leng, X., Lin, Z.: Contextual region-based convolutional neural network with multilayer fusion for sar ship detection. Remote Sens. 9(8), 860 (2017)CrossRefGoogle Scholar
  20. 20.
    Baltrušaitis, T., Ahuja, C., Morency, L.P.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. (2018)Google Scholar
  21. 21.
    Chandar, S., Khapra, M.M., Larochelle, H., Ravindran, B.: Correlational neural networks. Neural Comput. 28(2), 257–285 (2016)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Neverova, N., Wolf, C., Taylor, G., Nebout, F.: Moddrop: adaptive multi-modal gesture recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38(8), 1692–1706 (2016)CrossRefGoogle Scholar
  23. 23.
    Li, F., Neverova, N., Wolf, C., Taylor, G.: Modout: learning to fuse modalities via stochastic regularization. J. Comput. Vis. Imaging Syst. 2(1) (2016)Google Scholar
  24. 24.
    Escalera, S., et al.: ChaLearn looking at people challenge 2014: dataset and results. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8925, pp. 459–473. Springer, Cham (2015). Scholar
  25. 25.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
  26. 26.
    Jackson, Z.: Free-spoken-digit-dataset (2017).
  27. 27.
    Piczak, K.J.: ESC: dataset for environmental sound classification. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 1015–1018. ACM (2015)Google Scholar
  28. 28.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)Google Scholar
  29. 29.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Valentin Vielzeuf
    • 1
    • 2
    Email author
  • Alexis Lechervy
    • 2
  • Stéphane Pateux
    • 1
  • Frédéric Jurie
    • 2
  1. 1.Orange LabsRennesFrance
  2. 2.Normandie Univ., UNICAEN, ENSICAEN, CNRSCaenFrance

Personalised recommendations