Hierarchical-Gate Multimodal Network for Human Communication Comprehension

Liu, Qiyuan; Wu, Liangqing; Xu, Yang; Zhang, Dong; Li, Shoushan; Zhou, Guodong

doi:10.1007/978-3-030-32236-6_18

Qiyuan Liu¹³,
Liangqing Wu¹³,
Yang Xu¹³,
Dong Zhang¹³,
Shoushan Li¹³ &
…
Guodong Zhou¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11839))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

4689 Accesses

Abstract

Computational modeling of human multimodal language is an emerging research area in natural language processing spanning the language, visual and acoustic modalities. Comprehending multimodal language requires modeling not only the interactions within each modality (intra-modal interactions) but more importantly the interactions between modalities (cross-modal interactions). In this paper, we present a novel neural architecture for understanding human communication called the Hierarchical-gate Multimodal Network (HGMN). Specifically, each modality is first encoded by Bi-LSTM which aims to capture the intra-modal interactions within single modality. Subsequently, we merge the independent information of multi-modality using two gated layers. The first gate which is named as modality-gate will calculate the weight of each modality. And the other gate called temporal-gate will control each time-step contribution for final prediction. Finally, the max-pooling strategy is used to reduce the dimension of the multimodal representation, which will be fed to the prediction layer. We perform extensive comparisons on five publicly available datasets for multimodal sentiment analysis, emotion recognition and speaker trait recognition. HGMN shows state-of-the-art performance on all the datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Baum, L.E., Petrie, T.: Statistical inference for probabilistic functions of finite state Markov chains. Ann. Math. Stat. 37(6), 1554–1563 (1966)
Article MathSciNet Google Scholar
Chen, M., Wang, S., Liang, P.P., Baltrušaitis, T., Zadeh, A., Morency, L.P.: Multimodal sentiment analysis with word-level fusion and reinforcement learning. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction, pp. 163–171. ACM (2017)
Google Scholar
Degottex, G., Kane, J., Drugman, T., Raitio, T., Scherer, S.: COVAREPA collaborative voice analysis repository for speech technologies. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 960–964. IEEE (2014)
Google Scholar
Morency, L.P., Mihalcea, R., Doshi, P.: Towards multimodal sentiment analysis: harvesting opinions from the web. In: Proceedings of the 13th International Conference on Multimodal Interfaces, pp. 169–176. ACM (2011)
Google Scholar
Nojavanasghari, B., Gopinath, D., Koushik, J., Baltrušaitis, T., Morency, L.P.: Deep multimodal fusion for persuasiveness prediction. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 284–288. ACM (2016)
Google Scholar
Park, S., Shim, H.S., Chatterjee, M., Sagae, K., Morency, L.P.: Computational analysis of persuasiveness in social multimedia: a novel dataset and multimodal prediction approach. In: Proceedings of the 16th International Conference on Multimodal Interaction, pp. 50–57. ACM (2014)
Google Scholar
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Poria, S., Cambria, E., Gelbukh, A.: Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2539–2544 (2015)
Google Scholar
Poria, S., Cambria, E., Hazarika, D., Majumder, N., Zadeh, A., Morency, L.P.: Context-dependent sentiment analysis in user-generated videos. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 873–883 (2017)
Google Scholar
Quattoni, A., Wang, S., Morency, L.P., Collins, M., Darrell, T.: Hidden conditional random fields. IEEE Trans. Pattern Anal. Mach. Intell. 10, 1848–1852 (2007)
Article Google Scholar
Rajagopalan, S.S., Morency, L.-P., Baltrus̆aitis, T., Goecke, R.: Extending long short-term memory for multi-view structured learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 338–353. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_21
Chapter Google Scholar
Wörtwein, T., Scherer, S.: What really matters—an information gain analysis of questions and reactions in automated PTSD screenings. In: Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 15–20. IEEE (2017)
Google Scholar
Yuan, J., Liberman, M.: Speaker identification on the SCOTUS corpus. J. Acoust. Soc. Am. 123(5), 3878 (2008)
Article Google Scholar
Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.P.: Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250 (2017)
Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.P.: Memory fusion network for multi-view sequential learning. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Zadeh, A., Liang, P.P., Poria, S., Vij, P., Cambria, E., Morency, L.P.: Multi-attention recurrent network for human communication comprehension. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Zadeh, A., Zellers, R., Pincus, E., Morency, L.P.: Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intell. Syst. 31(6), 82–88 (2016)
Article Google Scholar
Zadeh, A.B., Liang, P.P., Poria, S., Cambria, E., Morency, L.P.: Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2236–2246 (2018)
Google Scholar

Download references

Acknowledgments

The research work is partially supported by the Key Project of NSFC No. 61702149 and two NSFC grants No. 61672366, No. 61673290.

Author information

Authors and Affiliations

School of Computer Science and Technology, Soochow University, Suzhou, China
Qiyuan Liu, Liangqing Wu, Yang Xu, Dong Zhang, Shoushan Li & Guodong Zhou

Authors

Qiyuan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Liangqing Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Dong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Shoushan Li
View author publications
You can also search for this author in PubMed Google Scholar
Guodong Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dong Zhang .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Jie Tang
National University of Singapore, Singapore, Singapore
Min-Yen Kan
Peking University, Beijing, China
Dongyan Zhao
Peking University, Beijing, China
Sujian Li
Zhengzhou University, Zhengzhou, China
Hongying Zan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, Q., Wu, L., Xu, Y., Zhang, D., Li, S., Zhou, G. (2019). Hierarchical-Gate Multimodal Network for Human Communication Comprehension. In: Tang, J., Kan, MY., Zhao, D., Li, S., Zan, H. (eds) Natural Language Processing and Chinese Computing. NLPCC 2019. Lecture Notes in Computer Science(), vol 11839. Springer, Cham. https://doi.org/10.1007/978-3-030-32236-6_18

Download citation

DOI: https://doi.org/10.1007/978-3-030-32236-6_18
Published: 30 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32235-9
Online ISBN: 978-3-030-32236-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)