Predicting multimodal presentation skills based on instance weighting domain adaptation

Yagi, Yutaro; Okada, Shogo; Shiobara, Shota; Sugimura, Sota

doi:10.1007/s12193-021-00367-x

Predicting multimodal presentation skills based on instance weighting domain adaptation

Original Paper
Published: 18 February 2021

Volume 16, pages 1–16, (2022)
Cite this article

Journal on Multimodal User Interfaces Aims and scope Submit manuscript

Yutaro Yagi^1,2,
Shogo Okada ORCID: orcid.org/0000-0002-9260-0403¹,
Shota Shiobara² &
…
Sota Sugimura²

492 Accesses
5 Citations
Explore all metrics

Abstract

Presentation skills assessment is one of the central challenges of multimodal modeling. Presentation skills are composed of verbal and nonverbal skill components, but because people demonstrate their presentation skills in a variety of manners, the observed multimodal features vary widely. Due to the differences in features, when test data samples are generated on different training data sample distributions, in many cases, the prediction accuracy of the skills degrades. In machine learning theory, this problem in which training (source) data are biased is known as instance selection bias or covariate shift. To solve this problem, this paper presents an instance weighting adaptation method that is applied to estimate the presentation skills of each participant from multimodal (verbal and nonverbal) features. For this purpose, we collect a novel multimodal presentation dataset that includes audio signal data, body motion sensor data, and text data of the speech content for participants observed in 58 presentation sessions. The dataset also includes both verbal and nonverbal presentation skills, which are assessed by two external experts from a human resources department. We extract multimodal features, such as spoken utterances, acoustic features, and the amount of body motion, to estimate the presentation skills. We propose two approaches, early fusing and late fusing, for the regression models based on multimodal instance weighting adaptation. The experimental results show that the early fusing regression model with instance weighting adaptation achieved \(\rho =0.39\) for the Pearson correlation, which presents the regression accuracy for the clarity of presentation goal elements. In the maximum case, the accuracy (correlation coefficient) is improved from \(-0.34\) to +0.35 by instance weighting adaptation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic Estimation of Presentation Skills Using Speech, Slides and Gestures

ModSelect: Automatic Modality Selection for Synthetic-to-Real Domain Generalization

Multimodal Data Fusion Architectures in Audiovisual Speech Recognition

Notes

The spoken content in the presentations include private information related to the company and the presenter, so the data set is not available to the public due to privacy policies.
The lecturers provide feedback comments, including the good points in the presentation or points to be improved, to the attendees after the program.
https://www.audeering.com/opensmile/.
https://github.com/TadasBaltrusaitis/OpenFace.

References

Aran O, Gatica-Perez D (2013) One of a kind: inferring personality impressions in meetings. In: Proceedings of ACM ICMI, pp 11–18
Baltruŝaitis T, Mahmoud M, Robinson P (2015) Cross-dataset learning and person-specific normalisation for automatic action unit detection. In: Proceedings of FG workshops
Batrinca L, Mana N, Lepri B, Sebe N, Pianesi F (2016) Multimodal personality recognition in collaborative goal-oriented tasks. IEEE Trans Multimedia 18(4):659–673
Article Google Scholar
Berger CR (2003) Chapter 7 “Message Production Skill in Social Interaction”. In: Handbook of communication and social interaction skills. Psychology Press
Biel JI, Teijeiro-Mosquera L, Gatica-Perez D (2012) Facetube: predicting personality from facial expressions of emotion in online conversational video. In: Proceedings of ACM ICMI
Chen L, Feng G, Joe J, Leong CW, Kitchen C, Lee CM (2014) Towards automated assessment of public speaking skills using multimodal cues. In: Proceedings of ACM ICMI
Chollet M, Massachi T, Scherer S (2017) Racing heart and sweaty palms. In: Beskow J, Peters C, Castellano G, O’Sullivan C, Leite I, Kopp S (eds) Intelligent virtual agents. Springer International Publishing
Chollet M, Prendinger H, Scherer S (2016) Native versus non-native language fluency implications on multimodal interaction for interpersonal skills training. In: Proceedings of ACM ICMI
Chollet M, Scherer S (2017) Assessing public speaking ability from thin slices of behavior. In: Proceedings of IEEE FG
Chollet M, Stefanov K, Prendinger H, Scherer S (2015) Public speaking training with a multimodal interactive virtual audience framework. In: Proceedings of ACM ICMI
Chollet M, Wörtwein T, Morency LP, Shapiro A, Scherer S (2015) Exploring feedback strategies to improve public speaking: An interactive virtual audience framework. In: Proceedings of ACM UbiComp
Greene JO, Burleson BR (2003) Handbook of communication and social interaction skills. Psychology Press
Hall JA (1984) Nonverbal sex differences? Communication accuracy and expressive style. Johns Hopkins University Press
Hoque ME, Courgeon M, Martin JC, Mutlu B, Picard RW (2013) Mach: my automated conversation coach. In: Proceedings of ACM UbiComp. ACM, pp 697–706
Härdle W, Müller M, Sperlich S, Werwatz A (2004) Nonparametric and semiparametric models
Ishii R, Otsuka K, Kumano S, Higashinaka R, Tomita J (2018) Analyzing gaze behavior and dialogue act during turn-taking for estimating empathy skill level. In: Proceedings of ACM ICMI
Jayagopi DB, Sanchez-Cortes D, Otsuka K, Yamato J, Gatica-Perez D (2012) Linking speaking and looking behavior patterns with group composition, perception, and performance. In: Proceedings of ACM ICMI
Kanamori T, Hido S, Sugiyama M (2009) A least-squares approach to direct importance estimation. J Mach Learn Res 10:1391–1445
MathSciNet MATH Google Scholar
Kanamori T, Suzuki T, Sugiyama M (2012) Statistical analysis of kernel-based least-squares density-ratio estimation. Mach Learn 86(3):335–367
Article MathSciNet Google Scholar
Kudo T, Yamamoto K, Matsumoto Y (2004) Applying conditional random fields to Japanese morphological analysis. In: Proceedings of EMNLP
Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 33(1):159–174
Article Google Scholar
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of ICML
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436
Article Google Scholar
Li Y, Kambara H, Koike Y, Sugiyama M (2010) Application of covariate shift adaptation techniques in brain-computer interfaces. IEEE Trans Biomed Eng 57(6):1318–1324
Article Google Scholar
Lin YS, Lee CC (2018) Using interlocutor-modulated attention blstm to predict personality traits in small group interaction. In: Proceedings of ACM ICMI
Lombard M, Snyder-Duch J, Bracken C (2005) Practical resources for assessing and reporting intercoder reliability in content analysis research projects. Retrieved April 19
Mikolov T, Corrado G, Chen K, Dean J (2013) Efficient estimation of word representations in vector space
Nguyen L, Frauendorfer D, Mast M, Gatica-Perez D (2014) Hire me: computational inference of hirability in employment interviews based on nonverbal behavior. IEEE Trans Multimedia
Okada S, Komatani K (2018) Investigating effectiveness of linguistic features based on speech recognition for storytelling skill assessment. In: Recent trends and future technology in applied intelligence. Springer International Publishing, pp 148–157
Okada S, Ohtake Y, Nakano YI, Hayashi Y, Huang HH, Takase Y, Nitta K (2016) Estimating communication skills using dialogue acts and nonverbal features in multiple discussion datasets. In: Proceedings of ACM ICMI
Park S, Shim HS, Chatterjee M, Sagae K, Morency LP (2014) Computational analysis of persuasiveness in social multimedia: A novel dataset and multimodal prediction approach. In: Proceedings of ACM ICMI
Pérez-Rosas V, Mihalcea R, Morency LP (2013) Utterance-level multimodal sentiment analysis. In: Proceedings of ACL
Pianesi F, Mana N, Cappelletti A, Lepri B, Zancanaro M (2008) Multimodal recognition of personality traits in social interactions. In: Proceedings of ACM ICMI
Ramanarayanan V, Leong CW, Chen L, Feng G, Suendermann-Oeft D (2015) Evaluating speech, face, emotion and body movement time-series features for automated multimodal presentation scoring. In: Proceedings of ACM ICMI
Rosenberg A, Hirschberg J (2005) Acoustic/prosodic and lexical correlates of charismatic speech. In: Proceedings of INTERSPEECH
Sanchez-Cortes D, Aran O, Mast MS, Gatica-Perez D (2012) A nonverbal behavior approach to identify emergent leaders in small groups. IEEE Trans Multimedia 14
Scherer S, Weibel N, Morency LP, Oviatt S (2012) Multimodal prediction of expertise and leadership in learning groups. In: Proceedings of the international workshop on MLA
Shimodaira H (2000) Improving predictive inference under covariate shift by weighting the log-likelihood function. J Stat Plan Inference 90(2):227–244
Article MathSciNet Google Scholar
Sugiyama M, Kawanabe M (2012) Machine learning in non-stationary environments: introduction to covariate shift adaptation. The MIT Press
Sugiyama M, Nakajima S, Kashima H, Buenau PV, Kawanabe M (2008) Direct importance estimation with model selection and its application to covariate shift adaptation. In: Proceedings of advances in neural information processing systems
Tanaka H, Negoro H, Iwasaka H, Nakamura S (2018) Listening skills assessment through computer agents. In: Proceedings of ACM ICMI
Tanaka H, Sakti S, Neubig G, Toda T, Negoro H, Iwasaka H, Nakamura S (2015) Automated social skills trainer. In: Proceedings of ACM IUI
Tsuboi Y, Kashima H, Hido S, Bickel S, Sugiyama M (2009) Direct density ratio estimation for large-scale covariate shift adaptation. J Inf Process 17:138–155
Google Scholar
Valente F, Kim S, Motlicek P (2012) Annotation and recognition of personality traits in spoken conversations from the ami meetings corpus. In: Proceedings of INTERSPEECH
Wood E, Baltruaitis T, Zhang X, Sugano Y, Robinson P, Bulling A (2015) Rendering of eyes for eye-shape registration and gaze estimation. In: Proceedings of IEEE ICCV
Wörtwein T, Chollet M, Schauerte B, Morency LP, Stiefelhagen R, Scherer S (2015) Multimodal public speaking performance assessment. In: Proceedings of ACM ICMI
Wörtwein T, Morency L, Scherer S (2015) Automatic assessment and analysis of public speaking anxiety: a virtual audience case study. In: Proceedings of ACII
Zadrozny B (2004) Learning and evaluating classifiers under sample selection bias. In: Proceedings of ICML

Download references

Acknowledgements

We appreciate the cooperation of the human resource development department of Softbank Corp. This work was partially supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Numbers 19H01120, 19H01719 and JST AIP Trilateral AI Research, Grant Number JPMJCR20G6, Japan.

Author information

Authors and Affiliations

Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, Ishikawa, 923-1292, Japan
Yutaro Yagi & Shogo Okada
SoftBank Group Corp., Tokyo Shiodome Bldg., 1-9-1 Higashi-shimbashi, Minato-ku, Tokyo, 105-7303, Japan
Yutaro Yagi, Shota Shiobara & Sota Sugimura

Authors

Yutaro Yagi
View author publications
You can also search for this author in PubMed Google Scholar
Shogo Okada
View author publications
You can also search for this author in PubMed Google Scholar
Shota Shiobara
View author publications
You can also search for this author in PubMed Google Scholar
Sota Sugimura
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shogo Okada.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Yutaro Yagi and Shogo Okada equally contributed.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yagi, Y., Okada, S., Shiobara, S. et al. Predicting multimodal presentation skills based on instance weighting domain adaptation. J Multimodal User Interfaces 16, 1–16 (2022). https://doi.org/10.1007/s12193-021-00367-x

Download citation

Received: 15 January 2020
Accepted: 03 February 2021
Published: 18 February 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s12193-021-00367-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Predicting multimodal presentation skills based on instance weighting domain adaptation

Abstract

Access this article

Similar content being viewed by others

Automatic Estimation of Presentation Skills Using Speech, Slides and Gestures

ModSelect: Automatic Modality Selection for Synthetic-to-Real Domain Generalization

Multimodal Data Fusion Architectures in Audiovisual Speech Recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Predicting multimodal presentation skills based on instance weighting domain adaptation

Abstract

Access this article

Similar content being viewed by others

Automatic Estimation of Presentation Skills Using Speech, Slides and Gestures

ModSelect: Automatic Modality Selection for Synthetic-to-Real Domain Generalization

Multimodal Data Fusion Architectures in Audiovisual Speech Recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation