Skip to main content

Hierarchical-Gate Multimodal Network for Human Communication Comprehension

  • Conference paper
  • First Online:
Natural Language Processing and Chinese Computing (NLPCC 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11839))

  • 4689 Accesses

Abstract

Computational modeling of human multimodal language is an emerging research area in natural language processing spanning the language, visual and acoustic modalities. Comprehending multimodal language requires modeling not only the interactions within each modality (intra-modal interactions) but more importantly the interactions between modalities (cross-modal interactions). In this paper, we present a novel neural architecture for understanding human communication called the Hierarchical-gate Multimodal Network (HGMN). Specifically, each modality is first encoded by Bi-LSTM which aims to capture the intra-modal interactions within single modality. Subsequently, we merge the independent information of multi-modality using two gated layers. The first gate which is named as modality-gate will calculate the weight of each modality. And the other gate called temporal-gate will control each time-step contribution for final prediction. Finally, the max-pooling strategy is used to reduce the dimension of the multimodal representation, which will be fed to the prediction layer. We perform extensive comparisons on five publicly available datasets for multimodal sentiment analysis, emotion recognition and speaker trait recognition. HGMN shows state-of-the-art performance on all the datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Baum, L.E., Petrie, T.: Statistical inference for probabilistic functions of finite state Markov chains. Ann. Math. Stat. 37(6), 1554–1563 (1966)

    Article  MathSciNet  Google Scholar 

  2. Chen, M., Wang, S., Liang, P.P., Baltrušaitis, T., Zadeh, A., Morency, L.P.: Multimodal sentiment analysis with word-level fusion and reinforcement learning. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction, pp. 163–171. ACM (2017)

    Google Scholar 

  3. Degottex, G., Kane, J., Drugman, T., Raitio, T., Scherer, S.: COVAREPA collaborative voice analysis repository for speech technologies. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 960–964. IEEE (2014)

    Google Scholar 

  4. Morency, L.P., Mihalcea, R., Doshi, P.: Towards multimodal sentiment analysis: harvesting opinions from the web. In: Proceedings of the 13th International Conference on Multimodal Interfaces, pp. 169–176. ACM (2011)

    Google Scholar 

  5. Nojavanasghari, B., Gopinath, D., Koushik, J., Baltrušaitis, T., Morency, L.P.: Deep multimodal fusion for persuasiveness prediction. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 284–288. ACM (2016)

    Google Scholar 

  6. Park, S., Shim, H.S., Chatterjee, M., Sagae, K., Morency, L.P.: Computational analysis of persuasiveness in social multimedia: a novel dataset and multimodal prediction approach. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 50–57. ACM (2014)

    Google Scholar 

  7. Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

    Google Scholar 

  8. Poria, S., Cambria, E., Gelbukh, A.: Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2539–2544 (2015)

    Google Scholar 

  9. Poria, S., Cambria, E., Hazarika, D., Majumder, N., Zadeh, A., Morency, L.P.: Context-dependent sentiment analysis in user-generated videos. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 873–883 (2017)

    Google Scholar 

  10. Quattoni, A., Wang, S., Morency, L.P., Collins, M., Darrell, T.: Hidden conditional random fields. IEEE Trans. Pattern Anal. Mach. Intell. 10, 1848–1852 (2007)

    Article  Google Scholar 

  11. Rajagopalan, S.S., Morency, L.-P., Baltrus̆aitis, T., Goecke, R.: Extending long short-term memory for multi-view structured learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 338–353. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_21

    Chapter  Google Scholar 

  12. Wörtwein, T., Scherer, S.: What really matters—an information gain analysis of questions and reactions in automated PTSD screenings. In: Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 15–20. IEEE (2017)

    Google Scholar 

  13. Yuan, J., Liberman, M.: Speaker identification on the SCOTUS corpus. J. Acoust. Soc. Am. 123(5), 3878 (2008)

    Article  Google Scholar 

  14. Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.P.: Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250 (2017)

  15. Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.P.: Memory fusion network for multi-view sequential learning. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  16. Zadeh, A., Liang, P.P., Poria, S., Vij, P., Cambria, E., Morency, L.P.: Multi-attention recurrent network for human communication comprehension. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  17. Zadeh, A., Zellers, R., Pincus, E., Morency, L.P.: Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intell. Syst. 31(6), 82–88 (2016)

    Article  Google Scholar 

  18. Zadeh, A.B., Liang, P.P., Poria, S., Cambria, E., Morency, L.P.: Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2236–2246 (2018)

    Google Scholar 

Download references

Acknowledgments

The research work is partially supported by the Key Project of NSFC No. 61702149 and two NSFC grants No. 61672366, No. 61673290.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dong Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, Q., Wu, L., Xu, Y., Zhang, D., Li, S., Zhou, G. (2019). Hierarchical-Gate Multimodal Network for Human Communication Comprehension. In: Tang, J., Kan, MY., Zhao, D., Li, S., Zan, H. (eds) Natural Language Processing and Chinese Computing. NLPCC 2019. Lecture Notes in Computer Science(), vol 11839. Springer, Cham. https://doi.org/10.1007/978-3-030-32236-6_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-32236-6_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-32235-9

  • Online ISBN: 978-3-030-32236-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics