Tutoring Robots

Multiparty Multimodal Social Dialogue with an Embodied Tutor
  • Samer Al Moubayed
  • Jonas Beskow
  • Bajibabu Bollepalli
  • Ahmed Hussen-Abdelaziz
  • Martin Johansson
  • Maria Koutsombogera
  • José David Lopes
  • Jekaterina Novikova
  • Catharine Oertel
  • Gabriel Skantze
  • Kalin Stefanov
  • Gül Varol
Part of the IFIP Advances in Information and Communication Technology book series (IFIPAICT, volume 425)


This project explores a novel experimental setup towards building spoken, multi-modally rich, and human-like multiparty tutoring agent. A setup is developed and a corpus is collected that targets the development of a dialogue system platform to explore verbal and nonverbal tutoring strategies in multiparty spoken interactions with embodied agents. The dialogue task is centered on two participants involved in a dialogue aiming to solve a card-ordering game. With the participants sits a tutor that helps the participants perform the task and organizes and balances their interaction. Different multimodal signals captured and auto-synchronized by different audio-visual capture technologies were coupled with manual annotations to build a situated model of the interaction based on the participants personalities, their temporally-changing state of attention, their conversational engagement and verbal dominance, and the way these are correlated with the verbal and visual feedback, turn-management, and conversation regulatory actions generated by the tutor. At the end of this chapter we discuss the potential areas of research and developments this work opens and some of the challenges that lie in the road ahead.


Multiparty Multimodal Turn-taking Tutor Conversational Dominance Non-verbal Signals Visual Attention Spoken Dialogue Embodied Agent Social Robot 


  1. 1.
    Cassell, J.: Embodied conversational agents. MIT Press, Cambridge (2009)Google Scholar
  2. 2.
    Rudnicky, A.: Multimodal dialogue systems. In: Minker, W., et al. (eds.) Spoken Multimodal Human-Computer Dialogue in Mobile Environments. Text, Speech and Language Technology, vol. 28, pp. 3–11. Springer (2005)Google Scholar
  3. 3.
    Clifford, N., Steuer, J., Tauber, E.: Computers are social actors. In: CHI 1994: Proc. of the SIGCHI Conference on Human Factors in Computing Systems, pp. 72–78. ACM Press (1994)Google Scholar
  4. 4.
    Cohen, P.: The role of natural language in a multimodal interface. In: Proc. of User Interface Software Technology (UIST 1992) Conference, pp. 143–149. Academic Press, Monterey (1992)Google Scholar
  5. 5.
    Cohen, P., Oviatt, S.: The role of voice input for human-machine communication. Proceedings of the National Academy of Sciences 92(22), 9921–9927 (1995)CrossRefGoogle Scholar
  6. 6.
    Castellano, G., Paiva, A., Kappas, A., Aylett, R., Hastie, H., Barendregt, W., Nabais, F., Bull, S.: Towards empathic virtual and robotic tutors. In: Lane, H.C., Yacef, K., Mostow, J., Pavlik, P. (eds.) AIED 2013. LNCS, vol. 7926, pp. 733–736. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  7. 7.
    Iacobelli, F., Cassell, J.: Ethnic Identity and Engagement in Embodied Conversational Agents. In: Pelachaud, C., Martin, J.-C., André, E., Chollet, G., Karpouzis, K., Pelé, D. (eds.) IVA 2007. LNCS (LNAI), vol. 4722, pp. 57–63. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  8. 8.
    Robins, B., Dautenhahn, K., te Boekhorst, R., Billard, A.: Robotic assistants in therapy and education of children with autism: Can a small humanoid robot help encourage social interaction skills? In: Universal Access in the Information Society, UAIS (2005)Google Scholar
  9. 9.
    Al Moubayed, S., Beskow, J., Skantze, G., Granström, B.: Furhat: A Back-projected Human-like Robot Head for Multiparty Human-Machine Interaction. In: Esposito, A., Esposito, A.M., Vinciarelli, A., Hoffmann, R., Müller, V.C. (eds.) COST 2102. LNCS, vol. 7403, pp. 114–130. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  10. 10.
    Skantze, G., Al Moubayed, S.: IrisTK: A statechart-based toolkit for multi-party face-to-face interaction. In: ICMI 2012, Santa Monica, CA (2012)Google Scholar
  11. 11.
    Oertel, C., Cummins, F., Edlund, J., Wagner, P., Campbell, N.: D64: A corpus of richly recorded conversational interaction. Journal of Multimodal User Interfaces (2012)Google Scholar
  12. 12.
    Edlund, J., Beskow, J., Elenius, K., Hellmer, K., Strömbergsson, S., House, D.: Spontal: A Swedish spontaneous dialogue corpus of audio, video and motion capture. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proc. of the Seventh Conference on International Language Resources and Evaluation (LREC 2010), Valetta, Malta, pp. 2992–2995 (2010)Google Scholar
  13. 13.
    Al Moubayed, S., Edlund, J., Gustafson, J.: Analysis of gaze and speech patterns in three-party quiz game interaction. In: Interspeech 2013, Lyon, France (2013)Google Scholar
  14. 14.
    Paggio, P., Allwood, J., Ahlsen, E., Jokinen, K., Navarretta, C.: The NOMCO multimodal Nordic resource - goals and characteristics. In: Proceedings of the Language Resources and Evaluation Conference (LREC 2010), Valetta, Malta (2010)Google Scholar
  15. 15.
    Carletta, J.: Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus. Language Resources and Evaluation 41(2), 181–190 (2007)CrossRefGoogle Scholar
  16. 16.
    Digman, J.M.: Personality structure: Emergence of the five-factor model. Annual Review of Psychology 41, 417–440 (1990)CrossRefGoogle Scholar
  17. 17.
    Bateman, T.S., Crant, J.M.: The proactive component of organizational behavior: A measure and correlates. Journal of Organizational Behavior 14(2), 103–118 (1993)CrossRefGoogle Scholar
  18. 18.
    Langelaan, S., Bakker, A., Van Doornen, L., Schaufeli, W.: Burnout and work engagement: Do individual differences make a difference? Personality and Individual Differences 40(3), 521–532 (2006)CrossRefGoogle Scholar
  19. 19.
    Laugwitz, B., Held, T., Schrepp, M.: Construction and evaluation of a user experience questionnaire. In: Holzinger, A. (ed.) USAB 2008. LNCS, vol. 5298, pp. 63–76. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  20. 20.
    Cronbach, L.J.: Coefficient alpha and the internal consistency of tests. Psychometrika 16, 297–334 (1951)CrossRefGoogle Scholar
  21. 21.
    Sacks, H.: A simplest systematics for the organization of turn-taking in conversation. Language 50, 696–735 (1974)CrossRefGoogle Scholar
  22. 22.
    Duncan, S.: Some Signals and Rules for Taking Speaking Turns in Conversation. Journal of Personality and Social Psychology 23, 283–292 (1972)CrossRefGoogle Scholar
  23. 23.
    Goodwin, C.: Restarts, pauses and the achievement of mutual gaze at turn-beginning. Sociological Inquiry 50(3-4), 272–302 (1980)CrossRefGoogle Scholar
  24. 24.
    Bohus, D., Horvitz, E.: Facilitating multiparty dialog with gaze, gesture, and speech. In: ICMI 2010, Beijing, China (2010)Google Scholar
  25. 25.
    Allwood, J., Nivre, J., Ahlsén, E.: On the semantics and pragmatics of linguistic feedback. Journal of Semantics 9(1), 1–29 (1993)Google Scholar
  26. 26.
    Koutsombogera, M., Papageorgiou, H.: Linguistic and Non-verbal Cues for the Induction of Silent Feedback. In: Esposito, A., Campbell, N., Vogel, C., Hussain, A., Nijholt, A. (eds.) Second COST 2102. LNCS, vol. 5967, pp. 327–336. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  27. 27.
    Allwood, J., Kopp, S., Grammer, K., Ahlsén, E., Oberzaucher, E., Koppensteiner, M.: The analysis of embodied communicative feedback in multimodal corpora: A prerequisite for behavior simulation. Journal on Language Resources and Evaluation 41(3-4), 255–272 (2007a)Google Scholar
  28. 28.
    Wittenburg, P., Brugman, H., Russel, A., Klassmann, A., Sloetjes, H.: ELAN: A professional framework for multimodality research. In: 5th International Conference on Language Resources and Evaluation (LREC 2006), pp. 1556–1559 (2006)Google Scholar
  29. 29.
    Allwood, J., Cerrato, L., Jokinen, K., Navarretta, C., Paggio, P.: The MUMIN Coding Scheme for the Annotation of Feedback, Turn Management and Sequencing Phenomena. Multimodal Corpora for Modeling Human Multimodal Behaviour. Journal on Language Resources and Evaluation 41(3-4), 273–287 (2007b)CrossRefGoogle Scholar
  30. 30.
    Bunt, H., Alexandersson, J., Carletta, J., Choe, J.-W., Fang, A.C., Hasida, K., Lee, K., Petukhova, V., Popescu-Belis, A., Romary, L., Soria, C., Traum, D.R.: Towards an ISO Standard for Dialogue Act Annotation. In: Seventh International Conference on Language Resources and Evaluation, LREC 2010 (2010)Google Scholar
  31. 31.
    Beskow, J.: Rule-based visual speech synthesis. In: Proc of the Fourth European Conference on Speech Communication and Technology (1995)Google Scholar
  32. 32.
    Al Moubayed, S., Edlund, J., Beskow, J.: Taming Mona Lisa: Communicating gaze faithfully in 2D and 3D facial projections. ACM Transactions on Interactive Intelligent Systems 1(2), 25 (2012)CrossRefGoogle Scholar
  33. 33.
    Al Moubayed, S., Skantze, G.: Turn-taking Control Using Gaze in Multiparty Human-Computer Dialogue: Effects of 2D and 3D Displays. In: AVSP 2011, Florence, Italy (2011)Google Scholar
  34. 34.
    Al Moubayed, S., Skantze, G.: Perception of Gaze Direction for Situated Interaction. In: 4th Workshop on Eye Gaze in Intelligent Human Machine Interaction, The 14th ACM International Conference on Multimodal Interaction, Santa Monica, CA, USA (2012)Google Scholar
  35. 35.
    Al Moubayed, S., Skantze, G., Beskow, J.: Lip-reading Furhat: Audio Visual Intelligibility of a Back Projected Animated Face. In: 10th International Conference on Intelligent Virtual Agents (IVA 2012), Santa Cruz, CA, USA (2012)Google Scholar
  36. 36.
    Skantze, G., Al Moubayed, S., Gustafson, J., Beskow, J., Granström, B.: Furhat at Robotville: A Robot Head Harvesting the Thoughts of the Public through Multi-party Dialogue. In: Proceedings of IVA-RCVA, Santa Cruz, CA (2012)Google Scholar
  37. 37.
    Harel, D.: Statecharts: A visual formalism for complex systems. Science of Computer Programming 8(3), 231–274 (1987)CrossRefzbMATHMathSciNetGoogle Scholar
  38. 38.
    Stiefelhagen, R., Zhu, J.: Head orientation and gaze direction in meetings. In: Conference on Human Factors in Computing Systems, pp. 858–859 (2002)Google Scholar
  39. 39.
    Ba, S.O., Odobez, J.-M.: Recognizing visual focus of attention from head pose in natural meetings. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 39(1), 16–33 (2009)CrossRefGoogle Scholar
  40. 40.
    Johansson, M., Skantze, G., Gustafson, J.: Head Pose Patterns in Multiparty Human-Robot Team-Building Interactions. In: Herrmann, G., Pearson, M.J., Lenz, A., Bremner, P., Spiers, A., Leonards, U. (eds.) ICSR 2013. LNCS, vol. 8239, pp. 351–360. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  41. 41.
    Al Moubayed, S., Beskow, J., Granström, B.: Auditory-Visual Prominence: From Intelligibilitty to Behavior. Journal on Multimodal User Interfaces 3(4), 299–311 (2010)CrossRefGoogle Scholar
  42. 42.
    Al Moubayed, S., Beskow, J.: Effects of Visual Prominence Cues on Speech Intelligibility. In: Auditory-Visual Speech Processing, AVSP 2009, Norwich, England (2009)Google Scholar
  43. 43.
    Streefkerk, B., Pols, L.C.W., ten Bosch, L.: Acoustical features as predictors for prominence in read aloud Dutch sentences used in anns. In: Eurospeech, Budapest, Hungary (1999)Google Scholar
  44. 44.
    Bevacqua, E., Pammi, S., Hyniewska, S.J., Schröder, M., Pelachaud, C.: Multimodal backchannels for embodied conversational agents. In: The International Conference on Intelligent Virtual Agents, Philadelphia, PA, USA (2010)Google Scholar
  45. 45.
    Zhang, J.Y., Toth, A.R., Collins-Thompson, K., Black, A.W.: Prominence prediction for super-sentential prosodic modeling based on a new database. In: ISCA Speech Synthesis Workshop, Pittsburgh, PA, USA (2004)Google Scholar
  46. 46.
    Al Moubayed, S., Chetouani, M., Baklouti, M., Dutoit, T., Mahdhaoui, A., Martin, J.-C., Ondas, S., Pelachaud, C., Urbain, J., Yilmaz, M.: Generating Robot/Agent Backchannels During a Storytelling Experiment. In: Proceedings of (ICRA 2009) IEEE International Conference on Robotics and Automation, Kobe, Japan (2009)Google Scholar
  47. 47.
    Terken, J.: Fundamental frequency and perceived prominence of accented syllables. The Journal of the Acoustical Society of America 89, 1768–1776 (1991)CrossRefGoogle Scholar
  48. 48.
    Wang, D., Narayanan, S.: An acoustic measure for word prominence in spontaneous speech. IEEE Transactions on Audio, Speech, and Language Processing 15, 690–701 (2007)CrossRefGoogle Scholar
  49. 49.
    Kullback, S.: Information Theory and Statistics. John Wiley and Sons (1959)Google Scholar
  50. 50.
    Hotelling, H., Eisenhart, M., Hastay, W., Wallis, W.A.: Multivariate quality control. McGraw-Hill (1947)Google Scholar
  51. 51.
    Cheveigne, A.D., Kawahara, H.: Yin, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America 111, 1917–1930 (2002)CrossRefGoogle Scholar
  52. 52.
    Greenberg, S., Carvey, H., Hitchcock, L., Chang, S.: Temporal properties of spontaneous speech - Asyllable-centric perspective. Journal of Phonetics 31, 465–485 (2003)CrossRefGoogle Scholar
  53. 53.
    Yilmaz, A., Javed, O., Shah, M.: Object tracking: A survey. ACM Comput. Surv. 38(4), Article 13 (2006)Google Scholar
  54. 54.
    Rienks, R., Heylen, D.: Dominance Detection in Meetings Using Easily Obtainable Features. In: Renals, S., Bengio, S. (eds.) MLMI 2005. LNCS, vol. 3869, pp. 76–86. Springer, Heidelberg (2006)CrossRefGoogle Scholar

Copyright information

© IFIP International Federation for Information Processing 2014

Authors and Affiliations

  • Samer Al Moubayed
    • 1
  • Jonas Beskow
    • 1
  • Bajibabu Bollepalli
    • 1
  • Ahmed Hussen-Abdelaziz
    • 5
  • Martin Johansson
    • 1
  • Maria Koutsombogera
    • 2
  • José David Lopes
    • 3
  • Jekaterina Novikova
    • 4
  • Catharine Oertel
    • 1
  • Gabriel Skantze
    • 1
  • Kalin Stefanov
    • 1
  • Gül Varol
    • 6
  1. 1.KTH SpeechMusic and HearingSweden
  2. 2.Institute for Language and Speech Processing- “Athena” R.C.Greece
  3. 3.Spoken Language Systems LaboratoryINESC ID LisboaPortugal
  4. 4.Department of Computer ScienceUniversity of BathUK
  5. 5.Institute of Communication AcousticsRuhr-Universität BochumGermany
  6. 6.Department of Computer EngineeringBoğaziçi UniversityTurkey

Personalised recommendations