Predicting multimodal presentation skills based on instance weighting domain adaptation


Presentation skills assessment is one of the central challenges of multimodal modeling. Presentation skills are composed of verbal and nonverbal skill components, but because people demonstrate their presentation skills in a variety of manners, the observed multimodal features vary widely. Due to the differences in features, when test data samples are generated on different training data sample distributions, in many cases, the prediction accuracy of the skills degrades. In machine learning theory, this problem in which training (source) data are biased is known as instance selection bias or covariate shift. To solve this problem, this paper presents an instance weighting adaptation method that is applied to estimate the presentation skills of each participant from multimodal (verbal and nonverbal) features. For this purpose, we collect a novel multimodal presentation dataset that includes audio signal data, body motion sensor data, and text data of the speech content for participants observed in 58 presentation sessions. The dataset also includes both verbal and nonverbal presentation skills, which are assessed by two external experts from a human resources department. We extract multimodal features, such as spoken utterances, acoustic features, and the amount of body motion, to estimate the presentation skills. We propose two approaches, early fusing and late fusing, for the regression models based on multimodal instance weighting adaptation. The experimental results show that the early fusing regression model with instance weighting adaptation achieved \(\rho =0.39\) for the Pearson correlation, which presents the regression accuracy for the clarity of presentation goal elements. In the maximum case, the accuracy (correlation coefficient) is improved from \(-0.34\) to +0.35 by instance weighting adaptation.

This is a preview of subscription content, access via your institution.

Fig. 1


  1. 1.

    The spoken content in the presentations include private information related to the company and the presenter, so the data set is not available to the public due to privacy policies.

  2. 2.

    The lecturers provide feedback comments, including the good points in the presentation or points to be improved, to the attendees after the program.

  3. 3.

  4. 4.


  1. 1.

    Aran O, Gatica-Perez D (2013) One of a kind: inferring personality impressions in meetings. In: Proceedings of ACM ICMI, pp 11–18

  2. 2.

    Baltruŝaitis T, Mahmoud M, Robinson P (2015) Cross-dataset learning and person-specific normalisation for automatic action unit detection. In: Proceedings of FG workshops

  3. 3.

    Batrinca L, Mana N, Lepri B, Sebe N, Pianesi F (2016) Multimodal personality recognition in collaborative goal-oriented tasks. IEEE Trans Multimedia 18(4):659–673

    Article  Google Scholar 

  4. 4.

    Berger CR (2003) Chapter 7 “Message Production Skill in Social Interaction”. In: Handbook of communication and social interaction skills. Psychology Press

  5. 5.

    Biel JI, Teijeiro-Mosquera L, Gatica-Perez D (2012) Facetube: predicting personality from facial expressions of emotion in online conversational video. In: Proceedings of ACM ICMI

  6. 6.

    Chen L, Feng G, Joe J, Leong CW, Kitchen C, Lee CM (2014) Towards automated assessment of public speaking skills using multimodal cues. In: Proceedings of ACM ICMI

  7. 7.

    Chollet M, Massachi T, Scherer S (2017) Racing heart and sweaty palms. In: Beskow J, Peters C, Castellano G, O’Sullivan C, Leite I, Kopp S (eds) Intelligent virtual agents. Springer International Publishing

  8. 8.

    Chollet M, Prendinger H, Scherer S (2016) Native versus non-native language fluency implications on multimodal interaction for interpersonal skills training. In: Proceedings of ACM ICMI

  9. 9.

    Chollet M, Scherer S (2017) Assessing public speaking ability from thin slices of behavior. In: Proceedings of IEEE FG

  10. 10.

    Chollet M, Stefanov K, Prendinger H, Scherer S (2015) Public speaking training with a multimodal interactive virtual audience framework. In: Proceedings of ACM ICMI

  11. 11.

    Chollet M, Wörtwein T, Morency LP, Shapiro A, Scherer S (2015) Exploring feedback strategies to improve public speaking: An interactive virtual audience framework. In: Proceedings of ACM UbiComp

  12. 12.

    Greene JO, Burleson BR (2003) Handbook of communication and social interaction skills. Psychology Press

  13. 13.

    Hall JA (1984) Nonverbal sex differences? Communication accuracy and expressive style. Johns Hopkins University Press

  14. 14.

    Hoque ME, Courgeon M, Martin JC, Mutlu B, Picard RW (2013) Mach: my automated conversation coach. In: Proceedings of ACM UbiComp. ACM, pp 697–706

  15. 15.

    Härdle W, Müller M, Sperlich S, Werwatz A (2004) Nonparametric and semiparametric models

  16. 16.

    Ishii R, Otsuka K, Kumano S, Higashinaka R, Tomita J (2018) Analyzing gaze behavior and dialogue act during turn-taking for estimating empathy skill level. In: Proceedings of ACM ICMI

  17. 17.

    Jayagopi DB, Sanchez-Cortes D, Otsuka K, Yamato J, Gatica-Perez D (2012) Linking speaking and looking behavior patterns with group composition, perception, and performance. In: Proceedings of ACM ICMI

  18. 18.

    Kanamori T, Hido S, Sugiyama M (2009) A least-squares approach to direct importance estimation. J Mach Learn Res 10:1391–1445

    MathSciNet  MATH  Google Scholar 

  19. 19.

    Kanamori T, Suzuki T, Sugiyama M (2012) Statistical analysis of kernel-based least-squares density-ratio estimation. Mach Learn 86(3):335–367

    MathSciNet  Article  Google Scholar 

  20. 20.

    Kudo T, Yamamoto K, Matsumoto Y (2004) Applying conditional random fields to Japanese morphological analysis. In: Proceedings of EMNLP

  21. 21.

    Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 33(1):159–174

    Article  Google Scholar 

  22. 22.

    Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of ICML

  23. 23.

    LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436

    Article  Google Scholar 

  24. 24.

    Li Y, Kambara H, Koike Y, Sugiyama M (2010) Application of covariate shift adaptation techniques in brain-computer interfaces. IEEE Trans Biomed Eng 57(6):1318–1324

    Article  Google Scholar 

  25. 25.

    Lin YS, Lee CC (2018) Using interlocutor-modulated attention blstm to predict personality traits in small group interaction. In: Proceedings of ACM ICMI

  26. 26.

    Lombard M, Snyder-Duch J, Bracken C (2005) Practical resources for assessing and reporting intercoder reliability in content analysis research projects. Retrieved April 19

  27. 27.

    Mikolov T, Corrado G, Chen K, Dean J (2013) Efficient estimation of word representations in vector space

  28. 28.

    Nguyen L, Frauendorfer D, Mast M, Gatica-Perez D (2014) Hire me: computational inference of hirability in employment interviews based on nonverbal behavior. IEEE Trans Multimedia

  29. 29.

    Okada S, Komatani K (2018) Investigating effectiveness of linguistic features based on speech recognition for storytelling skill assessment. In: Recent trends and future technology in applied intelligence. Springer International Publishing, pp 148–157

  30. 30.

    Okada S, Ohtake Y, Nakano YI, Hayashi Y, Huang HH, Takase Y, Nitta K (2016) Estimating communication skills using dialogue acts and nonverbal features in multiple discussion datasets. In: Proceedings of ACM ICMI

  31. 31.

    Park S, Shim HS, Chatterjee M, Sagae K, Morency LP (2014) Computational analysis of persuasiveness in social multimedia: A novel dataset and multimodal prediction approach. In: Proceedings of ACM ICMI

  32. 32.

    Pérez-Rosas V, Mihalcea R, Morency LP (2013) Utterance-level multimodal sentiment analysis. In: Proceedings of ACL

  33. 33.

    Pianesi F, Mana N, Cappelletti A, Lepri B, Zancanaro M (2008) Multimodal recognition of personality traits in social interactions. In: Proceedings of ACM ICMI

  34. 34.

    Ramanarayanan V, Leong CW, Chen L, Feng G, Suendermann-Oeft D (2015) Evaluating speech, face, emotion and body movement time-series features for automated multimodal presentation scoring. In: Proceedings of ACM ICMI

  35. 35.

    Rosenberg A, Hirschberg J (2005) Acoustic/prosodic and lexical correlates of charismatic speech. In: Proceedings of INTERSPEECH

  36. 36.

    Sanchez-Cortes D, Aran O, Mast MS, Gatica-Perez D (2012) A nonverbal behavior approach to identify emergent leaders in small groups. IEEE Trans Multimedia 14

  37. 37.

    Scherer S, Weibel N, Morency LP, Oviatt S (2012) Multimodal prediction of expertise and leadership in learning groups. In: Proceedings of the international workshop on MLA

  38. 38.

    Shimodaira H (2000) Improving predictive inference under covariate shift by weighting the log-likelihood function. J Stat Plan Inference 90(2):227–244

    MathSciNet  Article  Google Scholar 

  39. 39.

    Sugiyama M, Kawanabe M (2012) Machine learning in non-stationary environments: introduction to covariate shift adaptation. The MIT Press

  40. 40.

    Sugiyama M, Nakajima S, Kashima H, Buenau PV, Kawanabe M (2008) Direct importance estimation with model selection and its application to covariate shift adaptation. In: Proceedings of advances in neural information processing systems

  41. 41.

    Tanaka H, Negoro H, Iwasaka H, Nakamura S (2018) Listening skills assessment through computer agents. In: Proceedings of ACM ICMI

  42. 42.

    Tanaka H, Sakti S, Neubig G, Toda T, Negoro H, Iwasaka H, Nakamura S (2015) Automated social skills trainer. In: Proceedings of ACM IUI

  43. 43.

    Tsuboi Y, Kashima H, Hido S, Bickel S, Sugiyama M (2009) Direct density ratio estimation for large-scale covariate shift adaptation. J Inf Process 17:138–155

    Google Scholar 

  44. 44.

    Valente F, Kim S, Motlicek P (2012) Annotation and recognition of personality traits in spoken conversations from the ami meetings corpus. In: Proceedings of INTERSPEECH

  45. 45.

    Wood E, Baltruaitis T, Zhang X, Sugano Y, Robinson P, Bulling A (2015) Rendering of eyes for eye-shape registration and gaze estimation. In: Proceedings of IEEE ICCV

  46. 46.

    Wörtwein T, Chollet M, Schauerte B, Morency LP, Stiefelhagen R, Scherer S (2015) Multimodal public speaking performance assessment. In: Proceedings of ACM ICMI

  47. 47.

    Wörtwein T, Morency L, Scherer S (2015) Automatic assessment and analysis of public speaking anxiety: a virtual audience case study. In: Proceedings of ACII

  48. 48.

    Zadrozny B (2004) Learning and evaluating classifiers under sample selection bias. In: Proceedings of ICML

Download references


We appreciate the cooperation of the human resource development department of Softbank Corp. This work was partially supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Numbers 19H01120, 19H01719 and JST AIP Trilateral AI Research, Grant Number JPMJCR20G6, Japan.

Author information



Corresponding author

Correspondence to Shogo Okada.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Yutaro Yagi and Shogo Okada equally contributed.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Yagi, Y., Okada, S., Shiobara, S. et al. Predicting multimodal presentation skills based on instance weighting domain adaptation. J Multimodal User Interfaces (2021).

Download citation


  • Presentation skills
  • Speech production
  • Multimodal
  • Domain adaptation
  • Instance weighting