Two-handed gesture recognition and fusion with speech to command a robot

  • 802 Accesses

  • 42 Citations


Assistance is currently a pivotal research area in robotics, with huge societal potential. Since assistant robots directly interact with people, finding natural and easy-to-use user interfaces is of fundamental importance. This paper describes a flexible multimodal interface based on speech and gesture modalities in order to control our mobile robot named Jido. The vision system uses a stereo head mounted on a pan-tilt unit and a bank of collaborative particle filters devoted to the upper human body extremities to track and recognize pointing/symbolic mono but also bi-manual gestures. Such framework constitutes our first contribution, as it is shown, to give proper handling of natural artifacts (self-occlusion, camera out of view field, hand deformation) when performing 3D gestures using one or the other hand even both. A speech recognition and understanding system based on the Julius engine is also developed and embedded in order to process deictic and anaphoric utterances. The second contribution deals with a probabilistic and multi-hypothesis interpreter framework to fuse results from speech and gesture components. Such interpreter is shown to improve the classification rates of multimodal commands compared to using either modality alone. Finally, we report on successful live experiments in human-centered settings. Results are reported in the context of an interactive manipulation task, where users specify local motion commands to Jido and perform safe object exchanges.

This is a preview of subscription content, log in to check access.

Access options

Buy single article

Instant unlimited access to the full article PDF.

US$ 39.95

Price includes VAT for USA

Subscribe to journal

Immediate online access to all issues from 2019. Subscription will auto renew annually.

US$ 99

This is the net price. Taxes to be calculated in checkout.


  1. Alami, R., Chatila, R., Fleury, S., & Ingrand, F. (1998). An architecture for autonomy. The International Journal of Robotics Research, 17(4), 315–337.

  2. Arras, K., & Burgard, W. (Eds.), Robots in exhibitions, Lausanne, Switzerland, October 2002.

  3. Austermann, A., Yamada, S., Funakoshi, K., & Nakano, M. (2010). Learning naturally spoken commands for a robot. In Interspeech, Makuhari, Japan, September 2010.

  4. Axenbeck, T., Bennewitz, M., Behnke, S., & Burgard, W. (2008). Recognizing complex, parameterized gestures from monocular image sequences. In IEEE-RAS international conference on humanoid robots (Humanoids’08), Daejeon, South Korea, December 2008.

  5. Azad, P., Ude, A., Asfour, T., & Dillman, R. (2007). Stereo-based markerless human motion capture for humanoid robot systems. In Int. conf. on robotics and automation (ICRA’07), Roma, Italy, April 2007.

  6. Badii, A., & Thiemert, D. (2009). The CompanionAble project. In Workshop co-located with the Europ. conf. on ambient intelligence, Salzburg, Austria, November 2009.

  7. Bar-Shalom, Y., & Jaffer, A. G. (1998). Tracking and data association. San Diego: Academic Press.

  8. Bennewitz, M., Faber, F., Joho, D., Schreiber, M., & Behnke, S. (2005). Towards a humanoid museum guide robot that interacts with multiple persons. In Int. conf. on humanoid robots (HUMANOID’05) (pp. 418–423). Tsukuba, Japan.

  9. Bernier, O., & Collobert, D. (2001). Head and hands 3D tracking in real-time by the EM algorithm. In Workshop of int. conf. on computer vision, Vancouver, Canada.

  10. Bischoff, R., & Graefe, V. (2004). HERMES—a versatile personal robotic assistant. Proceedings of the IEEE, 92, 1759–1779.

  11. Chen, F. S., Fu, C. M., & Huang, C. L. (2003). Hand gesture recognition using a real-time tracking method and hidden Markov models. Image and Vision Computing, 21(8), 745–758.

  12. Corradini, A., & Gross, H. M. (2000). Camera-based gesture recognition for robot control. In Int. joint conf. on neural networks (IJCNN’00), Roma, Italy, July 2000.

  13. Davis, F. (1971). Inside intuition-what we know about non-verbal communication. New York: McGraw-Hill.

  14. Erol, A., Bebis, G., Nicolescu, M., Boyle, R., & Twombly, X. (2007). Vision-based hand pose estimation: a review. Computer Vision and Image Understanding, 108, 52–73.

  15. Fels, S., & Hinton, G. (1997). Glove-talk II: A neural network interface which maps gestures to parallel format speech synthesizer controls. IEEE Transactions on Neural Networks, 9(1), 205–212.

  16. Fong, T., Nourbakhsh, I., & Dautenhahn, K. (2003). A survey of socially interactive robots. Robotics and Autonomous Systems, 42, 143–166.

  17. Fontmarty, M., Lerasle, F., & Danès, P. (2007). Data fusion within a modified annealed particle filter dedicated to human motion capture. In Int. conf. on intelligent robots and systems (IROS’07) (pp. 3391–3396). San Diego, USA, November 2007.

  18. Fox, M., Ghallab, M., Infantes, G., & Long, D. (2006). Robot introspection through learned hidden Markov models. Artificial Intelligence, 170(2), 59–113.

  19. Galliano, S., Geoffrois, E., Mostefa, D., Choukri, K., Bonastre, J. F., & Gravier, G. (2005). The ESTER phase II evaluation campaign for the rich transcription of French broadcast news. In Interspeech/Eurospeech, Lisbon, Portugal, September 2005.

  20. Gorostiza, J., Barber, R., Khamis, A., & Malfaz, M. (2006). Multimodal human-robot interaction framework for a personal robot. In Int. symp. on robot and human interactive communication (RO-MAN’06) (pp. 39–44). Hatfield, UK, September 2006.

  21. Hanafiah, Z. M., Yamazaki, C., Nakamura, A., & Kuno, Y. (2004). Human-robot speech interface understanding inexplicit utterances using vision. In CHI 2004 (pp. 1321–1324). Vienna, Austria, April 2004.

  22. Harte, E., & Jarvis, R. (2007). Multimodal human-robot interaction in an assistive technology context. In Australian conf. on robotics and automation, Brisbane, Australia, December 2007.

  23. Hasanuzzaman, M., Ampornaramveth, V., Zhang, T., Bhuiyan, M., Shirai, Y., & Ueno, H. (2004). Real-time vision-based gesture recognition for human robot interaction. In Int. conf. on robotics and biomimetics, Shenyang, China, August 2004.

  24. Hasanuzzaman, M., Zhang, T., Ampornaramveth, V., & Ueno, H. (2007). Adaptive visual gesture recognition using a knowledge-based software platform. Robotics and Autonomous Systems, 55(8), 643–657.

  25. Huang, Y., Huang, T., & Niemann, H. (2002). Two-handed gesture tracking incorporating template warping with static segmentation. In Int. conf. on automatic face and gesture recognition (FGR’02), Washington, USA, May 2002 (pp. 275–280).

  26. Isard, M., & Blake, A. (1998a). I-CONDENSATION: Unifying low-level and high-level tracking in a stochastic framework. In European conf. on computer vision (ECCV’98) (pp. 893–908). Freibourg, Germany, June 1998.

  27. Isard, M., & Blake, A. (1998b). CONDENSATION—conditional density propagation for visual tracking. International Journal of Computer Vision, 29(1), 5–28.

  28. Isard, M., & Blake, A. (2001). BraMBLe: a Bayesian multiple blob tracker. In Int. conf. on computer vision (ICCV’01) (pp. 34–41). Vancouver, Canada.

  29. Just, A., Marcel, S., & Bernier, O. (2004). HMM and IOHMM for the recognition of mono and bi-manual 3D hand gestures. In British machine vision conference (BMVC’04), London, UK, September 2004.

  30. Lee, A., Kawahara, T., & Shikano, K. (2001). Julius—an open source real-time large vocabulary recognition engine. In European conference on speech communication and technology (EUROSPEECH) (pp. 1691–1694). Aalborg, Denmark, September 2001.

  31. Lopez-Cozar Delgado, R., & Araki, M. (2005). Spoken, multilingual and multimodal dialogues systems—development ans assessment. New York: Wiley.

  32. Maas, J. F., Spexard, T., Fritsch, J., Wrede, B., & Sagerer, G. (2006). BIRON, what’s the topic? a multi-modal topic tracker for improved human-robot interaction. In Int. symp. on robot and human interactive communication (RO-MAN’06), Hatfield, UK, September 2006.

  33. Moeslund, T., Hilton, A., & Kruger, V. (2006). A survey of advanced vision-based human motion capture and analysis. Computer Vision and Image Understanding, 104, 174–192.

  34. Murphy-Chutorian, E., & Trivedi, M. (2008). Head pose estimation in computer vision: a survey. Transactions on Pattern Analysis Machine Intelligence (PAMI’08).

  35. Nickel, K., & Stiefelhagen, R. (2006). Visual recognition of pointing gestures for human-robot interaction. Image and Vision Computing, 3(12), 1875–1884.

  36. Park, H. S., Kim, E. Y., Jang, S., & Park, S. H. (2005). HMM-based gesture recognition for robot control. In Iberian conf. on pattern recognition and image analysis (IbPRIA’05), Estoril, Portugal, June 2005.

  37. Pérennou, G., & de Calmès, M. (2000). MHATLex: Lexical resources for modelling the French pronunciation. In Int. conf. on language resources and evaluations (pp. 257–264). Athens, Greece, June 2000.

  38. Pérez, P., Vermaak, J., & Blake, A. (2004). Data fusion for visual tracking with particles. Proceedings of the IEEE, 92(3), 495–513.

  39. Pineau, J., Montemerlo, M., Pollack, M., Roy, N., & Thrun, S. (2003). Towards robotic assistants in nursing homes: challenges and results. Robotics and Autonomous Systems, 42, 271–281.

  40. Prodanov, P., & Drygajlo, A. (2003a). Multimodal interaction management for tour-guide robots using Bayesian networks. In Int. conf. on intelligent robots and systems (IROS’03) (pp. 3447–3452). Las Vegas, Canada, October 2003.

  41. Prodanov, P., & Drygajlo, A. (2003b). Bayesian networks for spoken dialogue managements in multimodal systems of tour-guide robots. In European conf. on speech communication and technology (EUROSPEECH’03) (pp. 1057–1060). Geneva, Switzerland. September 2003.

  42. Qu, W., Schonfeld, D., & Mohamed, M. (2007). Distributed Bayesian multiple-target tracking in crowded environments using multiple collaborative cameras. EURASIP Journal on Advances in Signal Processing.

  43. Rasmussen, C., & Hager, G. (2001). Probabilistic data association methods for tracking complex visual objects. Transactions on Pattern Analysis Machine Intelligence 560–576.

  44. Richarz, J., Martin, C., Scheidig, A., & Gross, H. M. (2006). There you go!—estimating pointing gestures in monocular images for mobile robot instruction. In Int. symp. on robot and human interactive communication (RO-MAN’06) (pp. 546–551). Hartfield, UK, September 2006.

  45. Rogalla, O., Ehrenmann, M., Zollner, R., Becher, R., & Dillman, R. (2004). Using gesture and speech control for commanding a robot. In Advances in human-robot interaction (Vol. 14). Berlin: Springer.

  46. Shimizu, M., Yoshizuka, T., & Miyamoto, H. (2006). A gesture recognition system using stereo vision and arm model fitting. In Int. conf. on brain-inspired information technology (BrainIT’06), Hibikino, Japan, September 2006.

  47. Siegwart, R. et al. (2003). Robox at expo 0.2: a large scale installation of personal robots. Robotics and Autonomous Systems, 42, 203–222.

  48. Skubic, M., Perzanowski, D., Blisard, S., Schultz, A., & Adams, W. (2004). Spatial language for human-robot dialogs. IEEE Transactions on Systems, Man, and Cybernetics, 2(34), 154–167.

  49. Stiefelhagen, R., Fügen, C., Gieselmann, P., Holzapfel, H., Nickel, K., & Waibel, A. (2004). Natural human-robot interaction using speech head pose and gestures. In Int. conf. on intelligent robots and systems (IROS’04), Sendal, Japan, October 2004.

  50. Stückler, J., Gräve, K., Kläß, J., Muszynski, S., Schreiber, M., Tischler, O., Waldukat, R., & Behnke, S. (2009). Dynamaid: Towards a personal robot that helps with household chores. In Robotics: science and systems conference (RSS’09), Seattle, USA, June 2009.

  51. Thayananthan, A., Stenger, B., Torr, P. H. S., & Cipolla, R. (2003). Learning a kinematic prior for tree-based filtering. In British machine vision conf. (BMVC’03) (Vol. 2, pp. 589–598). Norwick, UK, September 2003.

  52. Theobalt, C., Bos, J., Chapman, T., & Espinosa, A. (2002). Talking to godot: Dialogue with a mobile robot. In Int. conf. on intelligent robots and systems (IROS’02), Lausanne, Switzerland, September 2002.

  53. Triesch, J., & Von der Malsburg, C. (2001). A system for person-independent hand posture recognition against complex backgrounds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(12), 1449–1453.

  54. Vallée, M., Burger, B., Ertl, D., Lerasle, F., & Falb, J. (2009). Improving user of interfaces robots with multimodality. In Int. conf. on advanced robotics (ICAR’09), Munich, Germany.

  55. Viola, P., & Jones, M. (2001). Rapid Object Detection using a Boosted Cascade of Simple Features. In Int. conf. on computer vision and pattern recognition (CVPR’01), Hawaii, December 2001.

  56. Waldherr, S., Thrun, S., & Romero, R. (2000). A gesture-based interface for human-robot interaction. Autonomous Robots, 9(2), 151–173.

  57. Yoshizaki, M., Kuno, Y., & Nakamura, A. (2002). Mutual assistance between speech and vision for human-robot interface. In Int. conf. on intelligent robots and systems (IROS’02) (pp. 1308–1313). Lausanne, Switzerland, September 2002.

  58. Yu, T., & Wu, Y. (2004). Collaborative tracking of multiple targets. In Int. conf. on computer vision and pattern recognition (CVPR’04), Washington, USA, June 2004.

  59. Zhao, T., & Nevatia, R. (2004). Tracking multiple humans in crowded environment. In Int. conf. on computer vision and pattern recognition (CVPR’04), Washington, USA, June 2004.

Download references

Author information

Correspondence to B. Burger.

Electronic Supplementary Material

Below is the link to the electronic supplementary material.

(MP4 3.85 MB)

(MP4 3.85 MB)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Burger, B., Ferrané, I., Lerasle, F. et al. Two-handed gesture recognition and fusion with speech to command a robot. Auton Robot 32, 129–147 (2012) doi:10.1007/s10514-011-9263-y

Download citation


  • Human-robot interaction
  • Multiple object tracking
  • Two-handed gesture recognition
  • Vision and speech probabilistic fusion