Abstract
Advances in artificial intelligence are fundamentally changing how we relate to machines. We used to treat computers as tools, but now we expect them to be agents, and increasingly our instinct is to treat them like peers. This paper is an exploration of peer-to-peer communication between people and machines. Two ideas are central to the approach explored here: shared perception, in which people work together in a shared environment, and much of the information that passes between them is contextual and derived from perception; and visually grounded reasoning, in which actions are considered feasible if they can be visualized and/or simulated in 3D. We explore shared perception and visually grounded reasoning in the context of blocks world, which serves as a surrogate for cooperative tasks where the partners share a workspace. We begin with elicitation studies observing pairs of people working together in blocks world and noting the gestures they use. These gestures are grouped into three categories: social, deictic, and iconic gestures. We then build a prototype system in which people are paired with avatars in a simulated blocks world. We find that when participants can see but not hear each other, all three gesture types are necessary, but that when the participants can speak to each other the social and deictic gestures remain important while the iconic gestures become less so. We also find that ambiguities flip the conversational lead, in that the partner previously receiving information takes the lead in order to resolve the ambiguity.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The Kinect v2 estimates the positions of 25 joints, but the 8 lower-body joints are consistently obscured by the table.
References
Küster, D., Krumhuber, E., Kappas, A.: Nonverbal behavior online: a focus on interactions with and via artificial agents and avatars. In: The Social Psychology of Nonverbal Communication, pp. 272–302. Springer (2015)
Wobbrock, J.O., Morris, M.R., Wilson, A.D.: User-defined gestures for surface computing. In: CHI 2009, pp. 1083–1092. ACM, New York (2009). http://doi.acm.org/10.1145/1518701.1518866
Sproull, L., Subramani, M., Kiesler, S., Walker, J.H., Waters, K.: When the interface is a face. Hum. Comput. Interact. 11(2), 97–124 (1996)
Dastani, M., Lorini, E., Meyer, J.-J., Pankov, A.: Other-condemning anger \(=\) blaming accountable agents for unattainable desires. In: Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pp. 1520–1522. International Foundation for Autonomous Agents and Multiagent Systems (2017)
Li, J.: The benefit of being physically present: a survey of experimental works comparing copresent robots, telepresent robots and virtual agents. Int. J. Hum. Comput. Stud. 77, 23–37 (2015)
Bolt, R.A.: “Put-that-there”: voice and gesture at the graphics interface. ACM 14(3), 262–270 (1980)
Dumas, B., Lalanne, D., Oviatt, S.: Multimodal interfaces: a survey of principles, models and frameworks. In: Human Machine Interaction, pp. 3–26 (2009)
Turk, M.: Multimodal interaction: a review. Pattern Recogn. Lett. 36, 189–195 (2014)
Quek, F., McNeill, D., Bryll, R., Duncan, S., Ma, X.-F., Kirbas, C., McCullough, K.E., Ansari, R.: Multimodal human discourse: gesture and speech. ACM Trans. Comput. Hum. Interact. (TOCHI) 9(3), 171–193 (2002)
Clark, H.H., Brennan, S.E.: Grounding in communication. In: Resnick, L.B., Levine, J.M., Teasley, S.D. (eds.) Perspectives on Socially Shared Cognition, vol. 13, pp. 127–149. American Psychological Association (1991)
Clark, H.H., Wilkes-Gibbs, D.: Referring as a collaborative process. Cognition 22(1), 1–39 (1986). http://www.sciencedirect.com/science/article/pii/0010027786900107
Dillenbourg, P., Traum, D.: Sharing solutions: persistence and grounding in multimodal collaborative problem solving. J. Learn. Sci. 15(1), 121–151 (2006)
Fussell, S.R., Kraut, R.E., Siegel, J.: Coordination of communication: effects of shared visual context on collaborative work. In: Proceedings of the 2000 ACM Conference on Computer Supported Cooperative Work, CSCW 2000, pp. 21–30. ACM, New York (2000). http://doi.acm.org/10.1145/358916.358947
Fussell, S.R., Setlock, L.D., Yang, J., Ou, J., Mauer, E., Kramer, A.D.I.: Gestures over video streams to support remote collaboration on physical tasks. Hum. Comput. Interact. 19(3), 273–309 (2004)
Kraut, R.E., Fussell, S.R., Siegel, J.: Visual information as a conversational resource in collaborative physical tasks. Hum. Comput. Interact. 18(1), 13–49 (2003)
Gergle, D., Kraut, R.E., Fussell, S.R.: Action as language in a shared visual space. In: Proceedings of the 2004 ACM Conference on Computer Supported Cooperative Work, CSCW 2004, pp. 487–496. ACM, New York (2004). http://doi.acm.org/10.1145/1031607.1031687
Reeves, L.M., Lai, J., Larson, J.A., Oviatt, S., Balaji, T., Buisine, S., Collings, P., Cohen, P., Kraal, B., Martin, J.-C.: Guidelines for multimodal user interface design. Commun. ACM 47(1), 57–59 (2004)
Veinott, E.S., Olson, J., Olson, G.M., Fu, X.: Video helps remote work: speakers who need to negotiate common ground benefit from seeing each other. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 1999, pp. 302–309. ACM, New York (1999). http://doi.acm.org/10.1145/302979.303067
Lascarides, A., Stone, M.: Formal semantics for iconic gesture. In: Proceedings of the 10th Workshop on the Semantics and Pragmatics of Dialogue (BRANDIAL), pp. 64–71 (2006)
Clair, A.S., Mead, R., Matarić, M.J., et al.: Monitoring and guiding user attention and intention in human-robot interaction. In: ICRA-ICAIR Workshop, Anchorage, AK, USA, vol. 1025 (2010)
Matuszek, C., Bo, L., Zettlemoyer, L., Fox, D.: Learning from unscripted deictic gesture and language for human-robot interactions. In: AAAI, pp. 2556–2563 (2014)
Krishnaswamy, N., Pustejovsky, J.: Multimodal semantic simulations of linguistically underspecified motion events. In: Spatial Cognition X: International Conference on Spatial Cognition. Springer (2016)
Gilbert, M.: On Social Facts. Princeton University Press, Princeton (1992)
Stalnaker, R.: Common ground. Linguist. Philos. 25(5), 701–721 (2002)
Asher, N., Gillies, A.: Common ground, corrections, and coordination. Argumentation 17(4), 481–512 (2003)
Tomasello, M., Carpenter, M.: Shared intentionality. Dev. Sci. 10(1), 121–125 (2007)
Bergen, B.K.: Louder than words: the new science of how the mind makes meaning. In: Basic Books (AZ) (2012)
Hsiao, K.-Y., Tellex, S., Vosoughi, S., Kubat, R., Roy, D.: Object schemas for grounding language in a responsive robot. Connection Sci. 20(4), 253–276 (2008)
Dzifcak, J., Scheutz, M., Baral, C., Schermerhorn, P.: What to do and how to do it: translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In: IEEE International Conference on Robotics and Automation, ICRA 2009, pp. 4163–4168. IEEE (2009)
Cangelosi, A.: Grounding language in action and perception: from cognitive agents to humanoid robots. Phys. Life Rev. 7(2), 139–151 (2010)
Siskind, J.M.: Grounding the lexical semantics of verbs in visual perception using force dynamics and event logic. J. Artif. Intell. Res. (JAIR) 15, 31–90 (2001)
Wang, I., Narayana, P., Patil, D., Mulay, G., Bangar, R., Draper, B., Beveridge, R., Ruiz, J.: Eggnog: a continuous, multi-modal data set of naturally occurring gestures with ground truth labels. In: 12th IEEE International Conference on Automatic Face and Gesture Recognition (2017)
Kendon, A.: Gesture: Visible Action as Utterance. Cambridge University Press, New York (2004)
Zhang, Z.: Microsoft kinect sensor and its effect. IEEE MultiMedia 19, 4–10 (2012)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Krishnaswamy, N., Pustejovsky, J.: VoxSim: a visual platform for modeling motion language. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. ACL (2016)
Goldstone, W.: Unity Game Development Essentials. Packt Publishing Ltd., Birmingham (2009)
Pustejovsky, J., Moszkowicz, J.: The qualitative spatial dynamics of motion. J. Spat. Cogn. Comput. 11, 15–44 (2011)
Pustejovsky, J., Krishnaswamy, N.: VoxML: a visualization modeling language. In: Chair, N.C.C., Choukri, K., Declerck, T., Goggi, S., Grobelnik, M., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., Piperidis, S. (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association (ELRA), Paris, May 2016
Pustejovsky, J.: Dynamic event structure and habitat theory. In: Proceedings of the 6th International Conference on Generative Approaches to the Lexicon (GL2013), pp. 1–10. ACL (2013)
McDonald, D., Pustejovsky, J.: On the representation of inferences and their lexicalization. In: Advances in Cognitive Systems, vol. 3 (2014)
Pustejovsky, J.: The Generative Lexicon (1995)
Narayana, P., Beveridge, R., Draper, B.: Gesture recognition: focus on the hands. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)
Hirst, G., McRoy, S., Heeman, P., Edmonds, P., Horton, D.: Repairing conversational misunderstandings and non-understandings. Speech Commun. 15(3), 213–229 (1994). http://www.sciencedirect.com/science/article/pii/0167639394900736
Ponce-López, V., Chen, B., Oliu, M., Corneanu, C., Clapés, A., Guyon, I., Baró, X., Escalante, H.J., Escalera, S.: Chalearn lap 2016: first round challenge on first impressions - dataset and results. In: ECCV, pp. 400–418 (2016)
Acknowledgments
This work was supported by the US Defense Advanced Research Projects Agency (DARPA) and the Army Research Office (ARO) under contract #W911NF-15-1-0459 at Colorado State University and the University of Florida and contract #W911NF-15-C-0238 at Brandeis University.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Narayana, P. et al. (2019). Cooperating with Avatars Through Gesture, Language and Action. In: Arai, K., Kapoor, S., Bhatia, R. (eds) Intelligent Systems and Applications. IntelliSys 2018. Advances in Intelligent Systems and Computing, vol 868. Springer, Cham. https://doi.org/10.1007/978-3-030-01054-6_20
Download citation
DOI: https://doi.org/10.1007/978-3-030-01054-6_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01053-9
Online ISBN: 978-3-030-01054-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)